Easy interview question got harder: given numbers 1..100, find the missing number(s) given exactly k are missing

algorithmarraysmath

I had an interesting job interview experience a while back. The question started really easy:

Q1: We have a bag containing numbers 1, 2, 3, …, 100. Each number appears exactly once, so there are 100 numbers. Now one number is randomly picked out of the bag. Find the missing number.

I've heard this interview question before, of course, so I very quickly answered along the lines of:

A1: Well, the sum of the numbers 1 + 2 + 3 + … + N is (N+1)(N/2) (see Wikipedia: sum of arithmetic series). For N = 100, the sum is 5050.

Thus, if all numbers are present in the bag, the sum will be exactly 5050. Since one number is missing, the sum will be less than this, and the difference is that number. So we can find that missing number in O(N) time and O(1) space.

At this point I thought I had done well, but all of a sudden the question took an unexpected turn:

Q2: That is correct, but now how would you do this if TWO numbers are missing?

I had never seen/heard/considered this variation before, so I panicked and couldn't answer the question. The interviewer insisted on knowing my thought process, so I mentioned that perhaps we can get more information by comparing against the expected product, or perhaps doing a second pass after having gathered some information from the first pass, etc, but I really was just shooting in the dark rather than actually having a clear path to the solution.

The interviewer did try to encourage me by saying that having a second equation is indeed one way to solve the problem. At this point I was kind of upset (for not knowing the answer before hand), and asked if this is a general (read: "useful") programming technique, or if it's just a trick/gotcha answer.

The interviewer's answer surprised me: you can generalize the technique to find 3 missing numbers. In fact, you can generalize it to find k missing numbers.

Qk: If exactly k numbers are missing from the bag, how would you find it efficiently?

This was a few months ago, and I still couldn't figure out what this technique is. Obviously there's a Ω(N) time lower bound since we must scan all the numbers at least once, but the interviewer insisted that the TIME and SPACE complexity of the solving technique (minus the O(N) time input scan) is defined in k not N.

So the question here is simple:

How would you solve Q2?
How would you solve Q3?
How would you solve Qk?

Clarifications

Generally there are N numbers from 1..N, not just 1..100.
I'm not looking for the obvious set-based solution, e.g. using a bit set, encoding the presence/absence each number by the value of a designated bit, therefore using O(N) bits in additional space. We can't afford any additional space proportional to N.
I'm also not looking for the obvious sort-first approach. This and the set-based approach are worth mentioning in an interview (they are easy to implement, and depending on N, can be very practical). I'm looking for the Holy Grail solution (which may or may not be practical to implement, but has the desired asymptotic characteristics nevertheless).

So again, of course you must scan the input in O(N), but you can only capture small amount of information (defined in terms of k not N), and must then find the k missing numbers somehow.

Best Answer

Here's a summary of Dimitris Andreou's link.

Remember sum of i-th powers, where i=1,2,..,k. This reduces the problem to solving the system of equations

a₁ + a₂ + ... + a_k = b₁

a₁² + a₂² + ... + a_k² = b₂

...

a₁^k + a₂^k + ... + a_k^k = b_k

Using Newton's identities, knowing b_i allows to compute

c₁ = a₁ + a₂ + ... a_k

c₂ = a₁a₂ + a₁a₃ + ... + a_k-1a_k

...

c_k = a₁a₂ ... a_k

If you expand the polynomial (x-a₁)...(x-a_k) the coefficients will be exactly c₁, ..., c_k - see Viète's formulas. Since every polynomial factors uniquely (ring of polynomials is an Euclidean domain), this means a_i are uniquely determined, up to permutation.

This ends a proof that remembering powers is enough to recover the numbers. For constant k, this is a good approach.

However, when k is varying, the direct approach of computing c₁,...,c_k is prohibitely expensive, since e.g. c_k is the product of all missing numbers, magnitude n!/(n-k)!. To overcome this, perform computations in Z_q field, where q is a prime such that n <= q < 2n - it exists by Bertrand's postulate. The proof doesn't need to be changed, since the formulas still hold, and factorization of polynomials is still unique. You also need an algorithm for factorization over finite fields, for example the one by Berlekamp or Cantor-Zassenhaus.

High level pseudocode for constant k:

Compute i-th powers of given numbers
Subtract to get sums of i-th powers of unknown numbers. Call the sums b_i.
Use Newton's identities to compute coefficients from b_i; call them c_i. Basically, c₁ = b₁; c₂ = (c₁b₁ - b₂)/2; see Wikipedia for exact formulas
Factor the polynomial x^k-c₁x^k-1 + ... + c_k.
The roots of the polynomial are the needed numbers a₁, ..., a_k.

For varying k, find a prime n <= q < 2n using e.g. Miller-Rabin, and perform the steps with all numbers reduced modulo q.

EDIT: The previous version of this answer stated that instead of Z_q, where q is prime, it is possible to use a finite field of characteristic 2 (q=2^(log n)). This is not the case, since Newton's formulas require division by numbers up to k.

Related Solutions

Given an array of numbers, return array of products of all other numbers (no division)

An explanation of polygenelubricants method is: The trick is to construct the arrays (in the case for 4 elements)

{              1,         a[0],    a[0]*a[1],    a[0]*a[1]*a[2],  }
{ a[1]*a[2]*a[3],    a[2]*a[3],         a[3],                 1,  }

Both of which can be done in O(n) by starting at the left and right edges respectively.

Then multiplying the two arrays element by element gives the required result

My code would look something like this:

int a[N] // This is the input
int products_below[N];
p=1;
for(int i=0;i<N;++i) {
  products_below[i]=p;
  p*=a[i];
}

int products_above[N];
p=1;
for(int i=N-1;i>=0;--i) {
  products_above[i]=p;
  p*=a[i];
}

int products[N]; // This is the result
for(int i=0;i<N;++i) {
  products[i]=products_below[i]*products_above[i];
}

If you need to be O(1) in space too you can do this (which is less clear IMHO)

int a[N] // This is the input
int products[N];

// Get the products below the current index
p=1;
for(int i=0;i<N;++i) {
  products[i]=p;
  p*=a[i];
}

// Get the products above the curent index
p=1;
for(int i=N-1;i>=0;--i) {
  products[i]*=p;
  p*=a[i];
}

C++ – Missing number(s) Interview Question Redux

You are only specifying the time complexity, but the space complexity is also important to consider.

The problem complexity can be specified in term of N (the length of the range) and K (the number of missing elements).

In the question you link, the solution of using equations is O(K) in space (or perhaps a bit more ?), as you need one equation per unknown value.

There is also the preservation point: may you alter the list of known elements ? In a number of cases this is undesirable, in which case any solution involving reordering the elements, or consuming them, must first make a copy, O(N-K) in space.

I cannot see faster than a linear solution: you need to read all known elements (N-K) and output all unknown elements (K). Therefore you cannot get better than O(N) in time.

Let us break down the solutions

Destroying, O(N) space, O(N log N) time: in-place sort
Preserving, O(K) space ?, O(N log N) time: equation system
Preserving, O(N) space, O(N) time: counting sort

Personally, though I find the equation system solution clever, I would probably use either of the sorting solutions. Let's face it: they are much simpler to code, especially the counting sort one!

And as far as time goes, in a real execution, I think the "counting sort" would beat all other solutions hands down.

Note: the counting sort does not require the range to be [0, X), any range will do, as any finite range can be transposed to the [0, X) form by a simple translation.

EDIT:

Changed the sort to O(N), one needs to have all the elements available to sort them.

Having had some time to think about the problem, I also have another solution to propose. As noted, when N grows (dramatically) the space required might explode. However, if K is small, then we could change our representation of the list, using intervals:

{4, 5, 3, 1, 7}

can be represented as

[1,1] U [3,5] U [7,7]

In the average case, maintaining a sorted list of intervals is much less costly than maintaining a sorted list of elements, and it's as easy to deduce the missing numbers too.

The time complexity is easy: O(N log N), after all it's basically an insertion sort.

Of course what's really interesting is that there is no need to actually store the list, thus you can feed it with a stream to the algorithm.

On the other hand, I have quite a hard time figuring out the average space complexity. The "final" space occupied is O(K) (at most K+1 intervals), but during the construction there will be much more missing intervals as we introduce the elements in no particular order.

The worst case is easy enough: N/2 intervals (think odd vs even numbers). I cannot however figure out the average case though. My gut feeling is telling me it should be better than O(N), but I am not that trusting.

Clarifications

Best Answer

Related Solutions

Given an array of numbers, return array of products of all other numbers (no division)

C++ – Missing number(s) Interview Question Redux

Related Topic