I have been given this interview question:

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.

My analysis:

The size of the file is 4×10⁹×4 bytes = 16 GB.

We can do external sorting, thus letting us know the range of the integers.

My question is what is the best way to detect the missing integer in the sorted big integer sets?

My understanding (after reading all the answers):

Assuming we are talking about 32-bit integers, there are 2³² = 4*10⁹ distinct integers.

Case 1: we have 1 GB = 1 * 10⁹ * 8 bits = 8 billion bits memory.

Solution:

If we use one bit representing one distinct integer, it is enough. we don't need sort.

Implementation:

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

Case 2: 10 MB memory = 10 * 10⁶ * 8 bits = 80 million bits

Solution:

For all possible 16-bit prefixes, there are 2¹⁶ number of
integers = 65536, we need 2¹⁶ * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.

Build the counter of each bucket through the first pass through the file.

Scan the buckets, find the first one who has less than 65536 hit.

Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file

Scan the buckets built in step3, find the first bucket which doesnt
have a hit.

The code is very similar to above one.

Conclusion:
We decrease memory through increasing file pass.

^{A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.}

Best Answer

Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 2¹⁶ times. Do a second pass to find of which of the possible numbers in that bucket are used already.

If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.

If "integer" means mathematical integer: Read through the input once and keep track of the ~~largest number~~ length of the longest number you've ever seen. When you're done, output ~~the maximum plus one~~ a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).

Generate an integer that is not among four billion given ones

My analysis:

My understanding (after reading all the answers):

Case 1: we have 1 GB = 1 * 10⁹ * 8 bits = 8 billion bits memory.

Solution:

Implementation:

Case 2: 10 MB memory = 10 * 10⁶ * 8 bits = 80 million bits

Solution:

Best Answer

Related Topic

My analysis:

My understanding (after reading all the answers):

Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.

Solution:

Implementation:

Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits

Solution:

Best Answer

Related Solutions

Python – How to remove/delete a folder that is not empty

In-Place Radix Sort

Edit:

Related Topic

Case 1: we have 1 GB = 1 * 10⁹ * 8 bits = 8 billion bits memory.

Case 2: 10 MB memory = 10 * 10⁶ * 8 bits = 80 million bits