I am trying find the number of unique elements in a vector compared against multiple vectors using C++. The vectors are in sorted order and it can be of size 2,000,000.
Suppose I have,
v1: 5, 8, 13, 16, 20
v2: 2, 4, 6, 8
v3: 20
v4: 1, 2, 3, 4, 5, 6, 7
v5: 1, 3, 5, 7, 11, 13, 15
The number of unique elements in v1 is 1 (i.e. number 16).
I tried two approaches.
-
Added vectors v2,v3,v4 and v5 into a vector of vector. For each element in v1, checked if the element is present in any of the other vectors.
-
Combined all the vectors v2,v3,v4 and v5 using merge sort into a single vector and compared it against v1 to find the unique elements.
Note: sample_vector = v1 and all_vectors_merged contains v2,v3,v4,v5
//Method 1
unsigned int compute_unique_elements_1(vector<unsigned int> sample_vector,vector<vector<unsigned int> > all_vectors_merged)
{
unsigned int duplicate = 0;
for (unsigned int i = 0; i < sample_vector.size(); i++)
{
for (unsigned int j = 0; j < all_vectors_merged.size(); j++)
{
if (std::find(all_vectors_merged.at(j).begin(), all_vectors_merged.at(j).end(), sample_vector.at(i)) != all_vectors_merged.at(j).end())
{
duplicate++;
}
}
}
return sample_vector.size()-duplicate;
}
// Method 2
unsigned int compute_unique_elements_2(vector<unsigned int> sample_vector, vector<unsigned int> all_vectors_merged)
{
unsigned int unique = 0;
unsigned int i = 0, j = 0;
while (i < sample_vector.size() && j < all_vectors_merged.size())
{
if (sample_vector.at(i) > all_vectors_merged.at(j))
{
j++;
}
else if (sample_vector.at(i) < all_vectors_merged.at(j))
{
i++;
unique ++;
}
else
{
i++;
j++;
}
}
if (i < sample_vector.size())
{
unique += sample_vector.size() - i;
}
return unique;
}
Of these two techniques, I see that Method 2 gives faster results.
1) Method 1: Is there a more efficient way to find the elements than running std::find on all the vectors for all the elements in v1.
2) Method 2: Extra overhead in comparing vectors v2,v3,v4,v5 and sorting them.
How can I do this in a better way?
[edit]
Vectors are in sorted order.
Best Answer
Use hash tables. Elements are the key, number of occurrences are the values.