Python – Cost of Dict and Set In-Built Types

complexitydata structureshashingpython

I have undertaken a project concerning database deduplication. I did some research and have found that the Python dict type actually is a hashmap that uses open addressing.

In the deduplication module, we'll have some rules that determine whether two records are identical, with the rules essentially spelling out the attributes that uniquely identify the record (did not call it a candidate key as the DB is going to be non-relational, no-sql). Now let's say we deal with a really large dataset. Naturally, hashing is the way to go ( found that advice here).

The question(s):

  • does the module need to compute a hash and then store it in a dict? Would that not be unnecessary, as the dict implementation is itself a hashmap?
  • how costly is the conversion of a list to a set? That conversion should remove all the duplicates, but given the huge scale, is that practical?
  • what is the cost incurred in checking membership in a dict/set using the "in" keyword?

Hadoop MapReduce is not an option at least for now.

I can't really dive into the Python sources to figure this out, as I'm strictly time limited 😐

Best Answer

http://wiki.python.org/moin/TimeComplexity

That should pretty much cover everything. What more do you need that's not already on this page?

Related Topic