Algorithm
To generate a random string, concatenate characters drawn randomly from the set of acceptable symbols until the string reaches the desired length.
Implementation
Here's some fairly simple and very flexible code for generating random identifiers. Read the information that follows for important application notes.
public class RandomString {
/**
* Generate a random string.
*/
public String nextString() {
for (int idx = 0; idx < buf.length; ++idx)
buf[idx] = symbols[random.nextInt(symbols.length)];
return new String(buf);
}
public static final String upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
public static final String lower = upper.toLowerCase(Locale.ROOT);
public static final String digits = "0123456789";
public static final String alphanum = upper + lower + digits;
private final Random random;
private final char[] symbols;
private final char[] buf;
public RandomString(int length, Random random, String symbols) {
if (length < 1) throw new IllegalArgumentException();
if (symbols.length() < 2) throw new IllegalArgumentException();
this.random = Objects.requireNonNull(random);
this.symbols = symbols.toCharArray();
this.buf = new char[length];
}
/**
* Create an alphanumeric string generator.
*/
public RandomString(int length, Random random) {
this(length, random, alphanum);
}
/**
* Create an alphanumeric strings from a secure generator.
*/
public RandomString(int length) {
this(length, new SecureRandom());
}
/**
* Create session identifiers.
*/
public RandomString() {
this(21);
}
}
Usage examples
Create an insecure generator for 8-character identifiers:
RandomString gen = new RandomString(8, ThreadLocalRandom.current());
Create a secure generator for session identifiers:
RandomString session = new RandomString();
Create a generator with easy-to-read codes for printing. The strings are longer than full alphanumeric strings to compensate for using fewer symbols:
String easy = RandomString.digits + "ACEFGHJKLMNPQRUVWXYabcdefhijkprstuvwx";
RandomString tickets = new RandomString(23, new SecureRandom(), easy);
Use as session identifiers
Generating session identifiers that are likely to be unique is not good enough, or you could just use a simple counter. Attackers hijack sessions when predictable identifiers are used.
There is tension between length and security. Shorter identifiers are easier to guess, because there are fewer possibilities. But longer identifiers consume more storage and bandwidth. A larger set of symbols helps, but might cause encoding problems if identifiers are included in URLs or re-entered by hand.
The underlying source of randomness, or entropy, for session identifiers should come from a random number generator designed for cryptography. However, initializing these generators can sometimes be computationally expensive or slow, so effort should be made to re-use them when possible.
Use as object identifiers
Not every application requires security. Random assignment can be an efficient way for multiple entities to generate identifiers in a shared space without any coordination or partitioning. Coordination can be slow, especially in a clustered or distributed environment, and splitting up a space causes problems when entities end up with shares that are too small or too big.
Identifiers generated without taking measures to make them unpredictable should be protected by other means if an attacker might be able to view and manipulate them, as happens in most web applications. There should be a separate authorization system that protects objects whose identifier can be guessed by an attacker without access permission.
Care must be also be taken to use identifiers that are long enough to make collisions unlikely given the anticipated total number of identifiers. This is referred to as "the birthday paradox." The probability of a collision, p, is approximately n2/(2qx), where n is the number of identifiers actually generated, q is the number of distinct symbols in the alphabet, and x is the length of the identifiers. This should be a very small number, like 2‑50 or less.
Working this out shows that the chance of collision among 500k 15-character identifiers is about 2‑52, which is probably less likely than undetected errors from cosmic rays, etc.
Comparison with UUIDs
According to their specification, UUIDs are not designed to be unpredictable, and should not be used as session identifiers.
UUIDs in their standard format take a lot of space: 36 characters for only 122 bits of entropy. (Not all bits of a "random" UUID are selected randomly.) A randomly chosen alphanumeric string packs more entropy in just 21 characters.
UUIDs are not flexible; they have a standardized structure and layout. This is their chief virtue as well as their main weakness. When collaborating with an outside party, the standardization offered by UUIDs may be helpful. For purely internal use, they can be inefficient.
In Java 1.7 or later, the standard way to do this is as follows:
import java.util.concurrent.ThreadLocalRandom;
// nextInt is normally exclusive of the top value,
// so add 1 to make it inclusive
int randomNum = ThreadLocalRandom.current().nextInt(min, max + 1);
See the relevant JavaDoc. This approach has the advantage of not needing to explicitly initialize a java.util.Random instance, which can be a source of confusion and error if used inappropriately.
However, conversely there is no way to explicitly set the seed so it can be difficult to reproduce results in situations where that is useful such as testing or saving game states or similar. In those situations, the pre-Java 1.7 technique shown below can be used.
Before Java 1.7, the standard way to do this is as follows:
import java.util.Random;
/**
* Returns a pseudo-random number between min and max, inclusive.
* The difference between min and max can be at most
* <code>Integer.MAX_VALUE - 1</code>.
*
* @param min Minimum value
* @param max Maximum value. Must be greater than min.
* @return Integer between min and max, inclusive.
* @see java.util.Random#nextInt(int)
*/
public static int randInt(int min, int max) {
// NOTE: This will (intentionally) not run as written so that folks
// copy-pasting have to think about how to initialize their
// Random instance. Initialization of the Random instance is outside
// the main scope of the question, but some decent options are to have
// a field that is initialized once and then re-used as needed or to
// use ThreadLocalRandom (if using at least Java 1.7).
//
// In particular, do NOT do 'Random rand = new Random()' here or you
// will get not very good / not very random results.
Random rand;
// nextInt is normally exclusive of the top value,
// so add 1 to make it inclusive
int randomNum = rand.nextInt((max - min) + 1) + min;
return randomNum;
}
See the relevant JavaDoc. In practice, the java.util.Random class is often preferable to java.lang.Math.random().
In particular, there is no need to reinvent the random integer generation wheel when there is a straightforward API within the standard library to accomplish the task.
Best Answer
The Birthday Paradox, or why PRNGs produce duplicates more often than you might think.
There are a couple of issues at play in the OP's problem. One is the birthday paradox as mentioned above and the second is the nature of what you are generating, which does not inherently guarantee that a given number will not be repeated.
The Birthday Paradox applies where given value can occur more than once during the period of the generator - and therefore duplicates can happen within a sample of values. The effect of the Birthday Paradox is that the real likelihood of getting such duplicates is quite significant and the average period between them is smaller than one might otherwise have thought. This dissonance between the perceived and actual probabilities makes the Birthday Paradox a good example of a cognitive bias, where a naive intuitive estimate is likely to be wildly wrong.
A quick primer on Pseudo Random Number Generators (PRNGs)
The first part of your problem is that you are taking the exposed value of a random number generator and converting it to a much smaller number, so the space of possible values is reduced. Although some pseudo-random number generators do not repeat values during their period this transformation changes the domain to a much smaller one. The smaller domain invalidates the 'no repeats' condition so you can expect a significant likelihood of repeats.
Some algorithms, such as the linear congruential PRNG (
A'=AX|M
) do guarantee uniqueness for the entire period. In an LCG the generated value contains the entire state of the accumulator and no additional state is held. The generator is deterministic and cannot repeat a number within the period - any given accumulator value can imply only one possible successive value. Therefore, each value can only occur once within the period of the generator. However, the period of such a PRNG is relatively small - about 2^30 for typical implementations of the LCG algorithm - and cannot possibly be larger than the number of distinct values.Not all PRNG algorithms share this characteristic; some can repeat a given value within the period. In the OP's problem, the Mersenne Twister algorithm (used in Python's random module) has a very long period - much greater than 2^32. Unlike a Linear Congruential PRNG, the result is not purely a function of the previous output value as the accumulator contains additional state. With 32-bit integer output and a period of ~2^19937, it cannot possibly provide a such a guarantee.
The Mersenne Twister is a popular algorithm for PRNGs because it has good statistical and geometric properties and a very long period - desirable characteristics for a PRNG used on simulation models.
Good statistical properties mean that the numbers generated by the algorithm are evenly distributed with no numbers having a significantly higher probability of appearing than others. Poor statistical properties could produce unwanted skew in the results.
Good geometric properies mean that sets of N numbers do not lie on a hyperplane in N-dimensional space. Poor geometric properties can generate spurious correlations in a simulation model and distort the results.
A long period means that you can generate a lot of numbers before the sequence wraps around to the start. If a model needs a large number of iterations or has to be run from several seeds then the 2^30 or so discrete numbers available from a typical LCG implementation may not be sufficient. The MT19337 algorithm has a very long period - 2^19337-1, or about 10^5821. By comparison, the total number of atoms in the universe is estimated at about 10^80.
The 32-bit integer produced by an MT19337 PRNG cannot possibly represent enough discrete values to avoid repeating during such a large period. In this case, duplicate values are likely to occur and inevitable with a large enough sample.
The Birthday Paradox in a nutshell
This problem is originally defined as the probability of any two people in the room sharing the same birthday. The key point is that any two people in the room could share a birthday. People tend to naively misinterpret the problem as the probability of someone in the room sharing a birthday with a specific individual, which is the source of the cognitive bias that often causes people to underestimate the probability. This is the incorrect assumption - there is no requirement for the match to be to a specific individual and any two individuals could match.
The probability of a match occurring between any two individuals is much higher than the probability of a match to a specific individual as the match does not have to be to a specific date. Rather, you only have to find two individuals that share the same birthday. From this graph (which can be found on the Wikipedia page on the subject), we can see that we only need 23 people in the room for there to be a 50% chance of finding two that match in this way.
From the Wikipedia entry on the subject we can get a nice summary. In the OP's problem, we have 4,500 possible 'birthdays', rather than 365. For a given number of random values generated (equating to 'people') we want to know the probability of any two identical values appearing within the sequence.
Computing the likely effect of the Birthday Paradox on the OP's problem
For a sequence of 100 numbers, we have pairs (see Understanding the Problem) that could potentially match (i.e. the first could match with the second, third etc., the second could match the third, fourth etc. and so on), so the number of combinations that could potentially match is rather more than just 100.
From Calculating the Probability we get an expression of . The following snippet of Python code below does a naive evaluation of the probability of a matching pair occurring.
This produces a sensible looking result of p=0.669 for a match occurring within 100 numbers sampled from a population of 4500 possible values. (Maybe someone could verify this and post a comment if it's wrong). From this we can see that the lengths of runs between matching numbers observed by the OP seem to be quite reasonable.
Footnote: using shuffling to get a unique sequence of pseudo-random numbers
See this answer below from S. Mark for a means of getting a guaranteed unique set of random numbers. The technique the poster refers to takes an array of numbers (which you supply, so you can make them unique) and shuffles them into a random order. Drawing the numbers in sequence from the shuffled array will give you a sequence of pseudo-random numbers that are guaranteed not to repeat.
Footnote: Cryptographically Secure PRNGs
The MT algorithm is not cryptographically secure as it is relatively easy to infer the internal state of the generator by observing a sequence of numbers. Other algorithms such as Blum Blum Shub are used for cryptographic applications but may be unsuitable for simulation or general random number applications. Cryptographically secure PRNGs may be expensive (perhaps requiring bignum calculations) or may not have good geometric properties. In the case of this type of algorithm, the primary requirement is that it should be computationally infeasible to infer the internal state of the generator by observing a sequence of values.