The probability of a hash collision (2022)

spit2wind

The author uses writing techniques like those given by Joel Spolsky:

- Rule 1: Be Funny

- Rule 2: Writing a spec is like writing code for a brain to execute

- Rule 3: Write as simply as possible

- Rule 4: Review and reread several times

The author isn't quite as adept at integrating the humor as seemlessly as Joel, yet it's interesting to see how effective the style is, even for someone still learning it. I commend them for making the topic more accessible. It was probably fun to write and was definitely fun to read!

https://www.joelonsoftware.com/2000/10/15/painless-functiona...

dcminter

I too thought of it (in a good way) when "and an iron will" raised a chuckle from me.

I second the recommendation and often nudge colleagues towards that article.

meindnoch

In "Appendix A: Supporting proofs for approximations" the author justifies the approximation of 1-x with e^(-x) by computing lim_{x→0} (1-x)/e^(-x) = 1.

This is wrong. Consider how this criteria is also satisfied by 1-100000x, since lim_{x→0} (1-x)/(1-100000x) = 1. But this is clearly not a good first-order approximation for 1-x around 0.

The proper justification for replacing 1-x with e^-x around 0 is done by examining the first 2 terms of their Taylor expansions, in other words, the functions' value at 0 and their first derivative at 0. Since these match for 1-x and e^-x, they are good first-order approximations of each other around 0.

perihelions

The quickest way to work around the numeric overflow issues is to use the Stirling approximation of the logarithm of the factorial,

https://en.wikipedia.org/wiki/Stirling's_approximation#Speed...

You can build on that to build good, composable approximations of any the standard combinatorial functions, in log-space (and recover the approximations you want by simply exponentiating back). For example, if you've implemented an approximate ln-fac(n) ~ ln(fac(n)), you immediately get ln-choose, the logarithm of (n!)/(k!(n-k)!), as simply ln-fac(n) - ln-fac(k) - ln-fac(n-k). Fully composable: if ln-fac() is a numerically good approximation, then is so any reasonable sum or difference.

Or: the log of the binomial distribution PDF is simply ln-choose(n,k) + k*ln(p) + (n-k)*ln(1-p).

sudhirb

A more naive but less complex way to avoid overflows would be to use the original factorial expression but work with logs - in a rough experiment with python for k=10^8 and N=10^17 (numbers plucked out of thin air), the calculation takes ~20s on an M1 MacBook Pro.

This didn't feel too egregious until I looked at the space of inputs and outputs for SHA-256, which is 2^256 out and an absolutely stonking 2^(2^64) - 1 possible inputs.

It feels obvious but I'm still surprised by the efficiency of approximations.

j-pb

The fun part is when you take that approximation and apply it to 2^n.

If you have n bits, your collision probability is 0.5 at generating 2^(n/2) values.

Or put differently: a 128bit uuid gives you a "good enough" 64bit distributed autoincrement.

bravesoul2

Good rule of thumb there

Simon_O_Rourke

I've had the privilege (?) of experiencing one in the wild only once. Was hashing a few columns to create a hash key for a dimensional table in snowflake and two wholy different rows clashed.

permalac

Honest question.

How does one write something like this?

I get the interest, and the review process. What I mean is, is this a hobby where someone is passionate about soothing, or does some employers allow people to work on side projects?

I feel my life is mostly about direct value, and I don't really understand where I went wrong in the path for meaningful knowledge.

Any philosophical help will be very welcome, as you correctly guest I'm a bit lost.

richardwhiuk

Almost certainly a hobby. Employer would probably want something like this on a employer blog so that they get the benefits.

rienbdj

How many values can a UUID v4 take?

How many do you have to generate before a collision becomes a remote possibility?

NickHoff

A UUID v4 is a 128 bit number, but 4 bits are reserved to specify the version number and 2 more bits are reserved to specify the variant, which leaves 122 bits of randomness. That means it can take on 5 x 10^36 possible values. Following the birthday math, you'd have to generate about 103 trillion UUIDs to have a one-in-a-million chance of having a collision.

toast0

The question becomes, how bad is your random.

If your random is not uniformly distributed, you might get duplication from bias.

If your random is setup wrong and you get the same seeding multiple times, you might get duplication from that.

If your random is good, the birthday math should hold.

masklinn

A uuid has 122 bits of payload.

Depends what you consider “a remote possibility” to be (the birthday attack wiki page has a table for various p and powers of 2)

rurban

You can do bigint's or you can do approximations. I've tried about 8 different implementations in smhasher. (for 32bit to 256bit) See this massacre:

    // Naive multiplication, no accuracy at all
    static double ExpectedNBCollisions_Slow ( const double nbH, const double nbBits )
    {
      long balls = nbH;
      long double bins = nbBits;
      long double result = 1.0;
      for (long i = 1; i < balls / 2; i++) {
        // take a pair from the front and the end to minimize errors
        result *= ((bins - i) / bins) * ((bins - (nbH - i)) / bins);
      }
      return (double)(nbH * result);
    }
    
    // TODO This only works for a low number of collisions
    static inline double ExpectedCollisions ( const double balls, const double bins )
    {
      return balls - (bins * (1 - pow((bins - 1)/bins, balls)));
    }
    
    // Still too inaccurate: https://preshing.com/20110504/hash-collision-probabilities/
    static double EstimateNbCollisions_Taylor(const double nbH, const double nbBits)
    {
      const long double k = nbH;
      const long double b = nbBits;
      return (double)(k * (1.0 - expl(-0.5 * k * (k - 1.0) / b)));
    }
    
    // demerphq: (double(count) * double(count-1)) / pow(2.0,double(sizeof(hashtype) * 8 + 1));
    // the very same as our calc. pow 2 vs exp2. Just the high cutoff is missing here.
    static double EstimateNbCollisions_Demerphq(const double nbH, const double nbBits)
    {
      return (nbH * (nbH - 1)) / pow(2.0, nbBits + 1);
    }
    
    // GNU R: qbirthday. rough estimate. FIXME
    static double EstimateNbCollisions_R(const double nbH, const double nbBits)
    {
      return ceil(exp(((log(nbH) + lgamma(3) + log(-log1p(-0.5)))) / 2));
    }
    
    // GNU R: pbirthday. FIXME
    /*
      static double EstimateNbCollisions_Rp(const double c)
    {
      return (1 - prod((c:(c-0.5+1))/rep(2, 0.5)));
    }
    */
    
    // The previous best calculation, highly prone to inaccuracies with low results (1.0 - 10.0)
    // TODO: return also the error.
    static double EstimateNbCollisions_previmpl(const double nbH, const double nbBits)
    {
      double exp = exp2(nbBits); // 2 ^ bits
      double result = (nbH * (nbH-1)) / (2.0 * exp);
      if (result > nbH)
        result = nbH;
      // improved floating point accuracy
      if (result <= exp || nbBits > 32)
        return result;
      return result - exp;
    }
    
    static double EstimateNbCollisions_fwojcik(const double nbH, const int nbBits)
    {
        // If the probability that there are 1 or more collisions (p(C >=
        // 1)) is not much higher than the probability of exactly 1
        // collision (p(C == 1)), then the classically-good approximation
        // of the probability of any collisions is also a good estimate
        // for the expected number of collisions.
        //
        // If there are 2**n buckets and 2**(n-r) hashes, then the ratio
        // of p(C >= 1)/p(C == 1) is about 1/(1-2**(n-2r-1)). This uses
        // the new estimator if that ratio is > 1 + 2**-8. That cutoff
        // minimizes the error around the values we care about.
        if (nbBits - 2.0*log2(nbH) >= 8 - 1) {
            return nbH * (nbH - 1) * exp2(-nbBits-1);
        }
    
        // The probability that any given hash bucket is empty after nbH
        // insertions is:
        //    pE     = ((2**nbBits - 1)/(2**nbBits))**nbH
        // so we compute:
        //    ln(pE) = nbH * ln((2**nbBits - 1)/(2**nbBits))
        //           = nbH * ln(1 - 1/2**(nbBits))
        //           = nbH * ln(1 - 2**(-nbBits))
        //           = nbH * ln(1 + -(2**(-nbBits)))
        // This means the probability that any given hash bucket is
        // occupied after nbH insertions is:
        //     pF = 1 - pE
        //     pF = 1 - exp(ln(pE)
        //     pF = -(exp(ln(pE) - 1)
        //     pF = -expm1(ln(pE))
        // And the expected number of collisions is:
        //     C = m - n + n * pE
        //     C = m - n * (1 - pE)
        //     C = n * (m/n - 1 + pE)
        //     C = n * (m/n - (1 - pE))
        //     C = n * (m/n - pF)
        //     C = n * (m/n - (-expm1(ln(pE))))
        //     C = n * (m/n + expm1(ln(pE)))
        // Since the format of floats/doubles is k*2**n, multiplying by
        // exp2(x) doesn't lose any precision, and this formulation keeps
        // m/n and pF at the same general orders of magnitude, so it tends
        // to have very good precision. At low hash occupancy, pF is too
        // close to m/n for this formula to work well.
        double logpE = (double)nbH  * log1p(-exp2(-nbBits));
        double result = exp2(nbBits) * (exp2(-nbBits) * (double)nbH + expm1(logpE));
    
        return result;
    }

racedude

Good to know, thanks

HN

The probability of a hash collision (2022)

The probability of a hash collision (2022)