How to Generate a Sequence of Unique Random Integers

December 24, 2012

Suppose we wish to generate a sequence of 10000000 random 32-bit integers with no repeats. How can we do it?

$422253117, 3056114362, 1677071617, 478652086, 2970049140, ...$

I faced this problem recently, and considered several options before finally implementing a custom, non-repeating pseudo-random number generator which runs in O(1) time, requires just 8 bytes of storage, and has pretty good distribution. I thought I’d share the details here.

Approaches Considered

There are already several well-known pseudo-random number generators (PRNGs) such as the Mersenne Twister, an excellent PRNG which distributes integers uniformly across the entire 32-bit range. Unfortunately, calling this PRNG 10000000 times does not tend to generate a sequence of 10000000 unique values. According to Hash Collision Probabilities, the probability of all 10000000 random numbers being unique is just:

$1.970768 \times 10^{-10112}$

That’s astronomically unlikely. In fact, the expected number of unique values in such sequences is only about 9988367. You can try it for yourself using Python:

>>> len(set([random.randint(0, 0xffffffff) for i in xrange(10000000)]))
9988432


One obvious refinement is to reject random numbers which are already in the sequence, and continue iterating until we’ve reached 10000000 elements. To check whether a specific value is already in the sequence, we could search linearly, or we could keep a sorted copy of the sequence and use a binary search. We could even track the presence of each value explicitly, using a giant 512 MB bitfield or a sparse bitfield such as a Judy1 array.

Another refinement: Instead of generating an arbitrary 32-bit integer for each element and hoping it’s unique, we could generate a random index in the range [0, N) where N is the number of remaining unused values. The index would tell us which free slot to take next. We could probably locate each free slot in logarithmic time by implementing a trie suited for this purpose.

Brainstorming some more, an approach based on the Fisher-Yates Shuffle is also quite tempting. Using this approach, we could begin with an array containing all possible 32-bit integers, and shuffle the first 10000000 values out of the array to obtain our sequence. That would require 16 GB of memory. The footprint could be reduced by representing the array as a sparse associative map, such a JudyL array, storing only those x where A[x] ≠ x. Or, instead of starting with an array of all possible 32-bit integers, we could start with an initial sequence of any 10000000 sorted integers. In an attempt to span the available range of 32-bit values, we could even model the initial sequence as a Poisson process.

All of the above approaches either run in non-linear time, or require large amounts of storage. Several of them would be workable for a sequence of just 10000000 integers, but it got me thinking whether a more efficient approach, which scales up to any sequence length, is possible.

A Non-Repeating Pseudo-Random Number Generator

The ideal PRNG for this problem is one which would generate a unique, random integer the first 232 times we call it, then repeat the same sequence the next 232 times it is called, ad infinitum. In other words, a repeating cycle of 232 values. That way, we could begin the PRNG at any point in the cycle, always having the guarantee that the next 232 values are repeat-free.

One way to implement such a PRNG is to define a one-to-one function on the integers — a function which maps each 32-bit integer to another, uniquely. Let’s call such a function a permutation. If we come up with a good permutation, all we need is to call it with increasing inputs { 0, 1, 2, 3, … }. We could even begin the input sequence at any value.

For some reason, I remembered from first-year Finite Mathematics that when p is a prime number, $x^2\,\bmod\,p$ has some interesting properties. Numbers produced this way are called quadratic resides, and we can compute them in C using the expression x * x % p. In particular, the quadratic reside of x is unique as long as $2x < p$. For example, when p = 11, the quadratic residues of 0, 1, 2, 3, 4, 5 are all unique:

$\begin{array}{l} 0^2\,\bmod\,11 \equiv 0 \\ 1^2\,\bmod\,11 \equiv 1 \\ 2^2\,\bmod\,11 \equiv 4 \\ 3^2\,\bmod\,11 \equiv 9 \\ 4^2\,\bmod\,11 \equiv 5 \\ 5^2\,\bmod\,11 \equiv 3 \\ \end{array}$

As luck would have it, it also happens that for the remaining integers, the expression p - x * x % p fits perfectly into the remaining slots. This only works for primes p which satisfy $p \equiv 3\,\bmod\,4$.

This gives us a one-to-one permutation on the integers less than p, where p can be any prime satisying $p \equiv 3\,\bmod\,4$. Seems like a nice tool for building our custom PRNG.

In the case of our custom PRNG, we want a permutation which works on the entire range of 32-bit integers. However, 232 is not a prime number. The closest prime number less than 232 is 4294967291, which happens to satisfy $p \equiv 3\,\bmod\,4$. As a compromise, we can write a C++ function which permutes all integers below this prime, and simply maps the 5 remaining integers to themselves.

unsigned int permuteQPR(unsigned int x)
{
static const unsigned int prime = 4294967291;
if (x >= prime)
return x;  // The 5 integers out of range are mapped to themselves.
unsigned int residue = ((unsigned long long) x * x) % prime;
return (x <= prime / 2) ? residue : prime - residue;
}


This function, on its own, is not the world's best permutation -- it tends to cluster output values for certain ranges of input -- but it is one-to-one. As such, we can combine it with other one-to-one functions, such as addition and XOR, to achieve a much better permutation. I found the following expression works reasonably well. The intermediateOffset variable acts as a seed, putting a variety of different sequences at our disposal.

permuteQPR((permuteQPR(x) + intermediateOffset) ^ 0x5bf03635);


On GitHub, I've posted a C++ class which implements a pseudo-random number generator based on this expression.

0xc2ab6929
0xa0e2502c
0x95b7c9eb
0x2fb14c01
0x5d983e09
0xba14e8c1
0x90994968
0x07058db1
0x061acc1f
0x969860fe
0x92f5a666
0xfa1b08af
0x18a3354f
...


I've also posted a working project to verify that this PRNG really does output a cycle of 232 unique integers.

So, how does the randomness of this generator stack up? I'm not a PRNG expert, so I put my trust in TestU01, a library for testing the quality of PRNGs, published by the University of Montreal. Here's some test code to put our newly conceived PRNG through its paces. It passes all 15 tests in TestU01's SmallCrush test suite, which I guess is pretty decent. It also passes 140/144 tests in the more stringent Crush suite.

========= Summary results of SmallCrush =========

Version:          TestU01 1.2.3
Generator:        ursu_CreateRSU
Number of statistics:  15
Total CPU time:   00:00:49.95

All tests were passed


Perhaps this approach for generating a sequence of unique random numbers is already known, or perhaps it shares attributes with existing PRNGs. If so, I'd be interested to find out. If you wanted, you could probably adapt it to work on ranges of integers other than 232 as well. Surfing around, I noticed that OpenBSD implements another non-repeating PRNG, though I'm not sure their implementation is cyclical or covers the entire number space.

Incidentally, this PRNG is used in my next post, This Hash Table Is Faster Than a Judy Array.

Do you know any other way to solve this problem?

The way I’ve classically solved this problem is pretty much a kluge, but I thought I’d post regardless.
You need to sample 10000000 times without repetition from a population of 2^32. Therefore we need a random number with a range expressible in 101875163 (= ceiling(log2(2^32 choose 10000000))) bits. That’s a lot. Plus, there’s overflow if we sample from 0 to 2^101875163-1. However, if we want a fast random number that overflow is kind of necessary. I’ll spare you the details of how my prng works because it isn’t anything special (the algorithm is the classic WELL method[1]). Anyway, I had these methods lying around to begin with as I needed them for other projects of mine (specifically a rather large scale bloom filter implementaion). My method runs about 100x slower than yours though (pre-optimization). I should also note that it works better if you require many smaller sequences of numbers that don’t repeat, in fact it’s optimized for that use case.
[1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/lfsr04.pdf

It looks curious.
Have you ever heard about pseudo-random permutation?
I think this is exactly what you are looking for.
Goldreich in “Foundations of cryptography” describes
pseudo-random permutation generator based on DES [1].
Very simple description of this algorithm can be found
here [2].

For a long time I was fascinated with this exact problem. Only recently I stumbled upon the paper “Ciphers with Arbitrary Finite Domains” (http://www.cs.ucdavis.edu/~rogaway/papers/subset.pdf) which solves the problem quite elegantly. The magic google search term is “Format-preserving encryption”.

A simple linear congruential generator will have a full period if the parameters are chosen correctly; for instance, the x’s in the series x’ = 69069 x + 1 (mod 2^32) will cycle through all 2^32 possible values before repeating. See Knuth’s book Art of Computer Programming, Volume 2, Chapter 3, for the rules on choosing parameters that will give a full period.

That’s true, but when you test one such generator (using the sample code on p. 139 in the TestU01 documentation), it only passes 12/15 tests in the SmallCrush suite:

========= Summary results of SmallCrush =========

Version:          TestU01 1.2.3
Generator:        ulcg_CreateLCG
Number of statistics:  15
Total CPU time:   00:00:27.05
The following tests gave p-values outside [0.001, 0.9990]:
(eps  means a value < 1.0e-300):
(eps1 means a value < 1.0e-15):

Test                          p-value
----------------------------------------------
1  BirthdaySpacings                 eps
2  Collision                        eps
6  MaxOft                           eps
----------------------------------------------
All other tests were passed


Preshing Random Number Generator.
Nice work again Jeff!

Hi Jeff — another way is to use finite fields of order p^n with p=2 and n=32. Then you get a nice fit with the word size. The math is elegant but I didn’t get as good results with TestU01 as you did with permuteQPR, though. A write-up is here:

http://exegetotrope.blogspot.com/2013/01/cyclic-random-number-generation-using.html

How would MurmurHash3 hold up? It has no collisions for 4-byte keys. To generate the numbers you could just do ‘return murmur(counter++)’.

Sounds a bit like blum blum shub: http://en.wikipedia.org/wiki/Blum_Blum_Shub

Well, except blum blum shub is a cryptographic rng and uses the product of two large primes and also only reveals a single bit per squaring (to prevent discovery of the internal state).

That’s cool to learn Blum Blum Shub also makes use of quadratic residues. Thanks for sharing!

Hi,

There’s more than Tut01 to evaluate PRNG, you might find others tools useful:

- ent

http://www.fourmilab.ch/random/

- rngtest from package rng-tools

http://sourceforge.net/projects/gkernel/

- dieharder

http://www.phy.duke.edu/~rgb/General/dieharder.php

http://www.phy.duke.edu/~rgb/General/dieharder/dieharder.html

- STS

http://csrc.nist.gov/groups/ST/toolkit/rng/index.html