I would like to know how a set is implemented in C++. If I were to implement my own set container without using the STL provided container, what would be the best way to go about this task?
I understand STL sets are based on the abstract data structure of a binary search tree. So what is the underlying data structure? An array?
Also, how does insert()
work for a set? How does the set check whether an element already exists in it?
I read on wikipedia that another way to implement a set is with a hash table. How would this work?
Best Answer
Step debug into
g++
6.4 stdlibc++ sourceDid you know that on Ubuntu's 16.04 default
g++-6
package or a GCC 6.4 build from source you can step into the C++ library without any further setup?By doing that we easily conclude that a Red-black tree used in this implementation.
This makes sense, since
std::set
can be traversed in order, which would not be efficient in if a hash map were used.main.cpp
Compile and debug:
Now, if you step into
s.insert(1)
you immediately reach/usr/include/c++/6/bits/stl_set.h
:which clearly just forwards to
_M_t._M_insert_unique
.So we open the source file in vim and find the definition of
_M_t
:So
_M_t
is of type_Rep_type
and_Rep_type
is a_Rb_tree
.OK, now that is enough evidence for me. If you don't believe that
_Rb_tree
is a Black-red tree, step a bit further and read the algorithm.unordered_set
uses hash tableSame procedure, but replace
set
withunordered_set
on the code.This makes sense, since
std::unordered_set
cannot be traversed in order, so the standard library chose hash map instead of Red-black tree, since hash map has a better amortized insert time complexity.Stepping into
insert
leads to/usr/include/c++/6/bits/unordered_set.h
:So we open the source file in
vim
and search for_M_h
:So hash table it is.
std::map
andstd::unordered_map
Analogous for
std::set
vsstd:unordered_set
: What data structure is inside std::map in C++?Performance characteristics
You could also infer the data structure used by timing them:
Graph generation procedure and Heap vs BST analysis and at: Heap vs Binary Search Tree (BST)
We clearly see for:
std::set
, a logarithmic insertion timestd::unordered_set
, a more complex hashmap pattern:on the zoomed plot, we see that the times are basically constant and going towards 250ns, therefore much faster than the
std::map
, except for very small map sizesSeveral strips are clearly visible, and their inclination becomes smaller whenever the array doubles.
I believe this is due to average linearly increasing linked list walks withing each bin. Then when the array doubles, we have more bins, so shorter walks.