Tuesday 12 January 2010

Pseudo-Uniqueness

Pseudo-Uniqueness is a topic that came up during a project I took part in last year. Even though I already brought it up on that project's blog, it is a subject that I think would fit well into the spirit of the Reflection 42 blog, so here it is. Enjoy!

In the project, we were faced with the need to dynamically create unique user identifiers that did not follow a naming pattern (i.e. they needed to be very hard to guess).

So what did we do? Well, I wrote a utility class that creates hashes from input strings, then at user creation we passed the user's data to that class which returned an identifier. Using the MD5 algorithm for hashing, we would get a 32-digit hexadecimal number which would be virtually unique. Now I can hear some of you think "Virtually unique? How can you be sure it will not collide with an already existing ID?". Well, that is a good question and the short answer is that I cannot!

If you are interested, the long answer now follows.

If I generate a 32-digit hexadecimal number, it consists of 128 bits. 128 bits can be set in 2^128 different combinations, which means that the chance of getting the same hash for another user is 1 in 340282366920938463463374607431768211456 (39 digits). Lets say we have an unrealistically busy site, generating a million new users each day. Then after a thousand years we will have generated less than 4*10^11 different hashes. This means that we have a chance of 4*10^11 in 2^128 of calculating an already existing hash at this point with a probabiltiy in the magnitude of 10^-28.

Maybe our site is really successful and continues until no life can exist on earth (8Gyears from now). Well I am not going to bore you with the calculations for this one, but at that point we would have generated 3.2*10^21 different hashes. This means we would have the overwhelming probability of below 10^-18 of hitting one of the hashes we already used.

This is why I say virtually unique. The chance of getting the same ID is so small that it will probably never happen, and if it does the probability is so small that it probably did not. So I now leave it up to you to decide if this is unique enough for you!

By the way, I went with an SHA algorithm generating 160 bits instead, but that's just me!

1 comment:

  1. Fy fan, måste läsa mer här, grym information! :D
    Tack så mycket ^^
    //Victor

    ReplyDelete