Python – Cost of Dict and Set In-Built Types

complexitydata structureshashingpython

I have undertaken a project concerning database deduplication. I did some research and have found that the Python dict type actually is a hashmap that uses open addressing.

In the deduplication module, we'll have some rules that determine whether two records are identical, with the rules essentially spelling out the attributes that uniquely identify the record (did not call it a candidate key as the DB is going to be non-relational, no-sql). Now let's say we deal with a really large dataset. Naturally, hashing is the way to go ( found that advice here).

The question(s):

does the module need to compute a hash and then store it in a dict? Would that not be unnecessary, as the dict implementation is itself a hashmap?
how costly is the conversion of a list to a set? That conversion should remove all the duplicates, but given the huge scale, is that practical?
what is the cost incurred in checking membership in a dict/set using the "in" keyword?

Hadoop MapReduce is not an option at least for now.

I can't really dive into the Python sources to figure this out, as I'm strictly time limited 😐

Best Answer

http://wiki.python.org/moin/TimeComplexity

That should pretty much cover everything. What more do you need that's not already on this page?

Related Solutions

How, When, and Why to Use Custom Data Structures in Python

Well, that kind of depends on what you are willing to call a "data structure." According to Wikipdia, a data structure is simply a "particular way of storing and organizing data in a computer so that it can be used efficiently." So, therefore, classes would be a form of data structure, and classes are very widely used in Python. But, for the sake of this answer, I will assume you are more interested in what you might learn about in a data structures and algorithms class (e.g. trees, linked lists, queues, hashes, etc...).

Now, because of Python's pseudo-code-like syntax it can be a very useful language for implementing data structures in. If for no other purpose than just to aid in understanding these basic concepts. For example, when I first learned about linked list I decided to implement them in Python:

class node:
    def __init__(self, data = None, is_header = False):
        self.data = data
        self.next = None
    def __str__(self):
        return "|{0}| --> {1}".format(str(self.data), str(self.next))


class linked_list:
    def __init__(self):
        self.header = node("!Header!", True)

    def add_node(self, add_me):
        current_node = self.header
        while 1:
            if current_node.next == None:
            current_node.next = add_me
            break
            else:
            current_node = current_node.next    

    def print(self):
        print (str(self.header))

Now, this isn't a perfect example, nor is it even a totally proper implementation of a linked list, but it goes to illustrate something:

Python's simple syntax can be helpful in understanding data structures

Another example would be a priority queue that I built back in the day:

class priority_queue:
    def __init__(self):
        self.element = []
        self.priority = []

    def enqueue_with_priority(self, me, priority):
        where = 0
        for i in range(len(self.element)):
            if not priority < self.priority[i]:
            where = i
            break

        self.element.insert(where, me)
        self.priority.insert(where, priority)

    def dequeue_highest(self):
        return self.element.pop(0)

    def print(self):
        for i in self.element:
            print i

Again, not a perfect implementation but it illustrates another benefit of coding a data structure in Python:

Python is useful in prototyping data structures to optimize them for lower-level programming languages

Looking back on this code I see flaws in my implementation. The Python code, however, tends to be short and sweet. So if I wanted to implement a data structures in a lower-level language (such as a c-style language) I could first generate a quick Python prototype and then optimize later.

Finally, I think Python can help in the development of data structures, because:

In Python, development is quick, mistakes are allowed and you can try things out.

Imagine you are building a hash-table-like data structure, in a strongly-typed, compiled language you would usually try things out in an IDE, then have to compile and run it. In Python, you can just pull up IDLE, iPython or the Python interpreter and just try things out! No need to recompile for each little change to the hash function you want to try -- just plug it into the interpreter.

So, in conclusion, I guess what I'm saying is that I agree with you: there's not a lot of practicality in building your own data structures (since most anything you may want has already been implemented and optimized). However, for me, there is a lot of educational benefit (because of Python's ease of syntax), as well as a lot of creative and developmental freedom (due to Python's low-constraint, EAFP design).

It is important to note that although python (through its wide-reaching library) provides many standard data structures, a "data structure" (by definition) can be almost anything. So, in Python as well as any other language we may use to solve non-trivial problems, we are going to need to define new data structures. Therefore, it is quite arguable that serious Python developers create custom data structures just as much as serious developers in other languages do.

Python – How to Cleanly Restrict Object Property Types and Values

Here is some code of mine without the TypeError: __call__() missing 1 required positional argument: 'self'. How do you create this error?

>>> class X(Exception):
    def __call__(self):
        raise self


>>> x = X()
>>> x()

Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    x()
  File "<pyshell#3>", line 3, in __call__
    raise self
X
>>> class Other(X):
    pass

>>> o = Other()
>>> o()

Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    o()
  File "<pyshell#3>", line 3, in __call__
    raise self
Other

Best Answer

Related Solutions

How, When, and Why to Use Custom Data Structures in Python

Python – How to Cleanly Restrict Object Property Types and Values

Related Topic