Python Dictionary – Difference Between defaultdict and setdefault

dictionarypythonpython-3.x

In Think Python the author introduces defaultdict. The following is an excerpt from the book regarding defaultdict:

If you are making a dictionary of lists, you can often write simpler
code using defaultdict. In my solution to Exercise 12-2, which you can
get from http://thinkpython2.com/code/anagram_sets.py, I make a
dictionary that maps from a sorted string of letters to the list of
words that can be spelled with those letters. For example, 'opst' maps
to the list ['opts', 'post', 'pots', 'spot', 'stop', 'tops']. Here’s
the original code:

 def all_anagrams(filename):
     d = {}
     for line in open(filename):
         word = line.strip().lower()
         t = signature(word)
         if t not in d:
             d[t] = [word]
         else:
             d[t].append(word) return d

This can be simplified using setdefault, which you might have used in Exercise 11-2:

 def all_anagrams(filename):
     d = {}
     for line in open(filename):
         word = line.strip().lower()
         t = signature(word)
         d.setdefault(t, []).append(word) return d

This solution has the drawback that it makes a new list every time, regardless of whether it is needed. For lists, that’s no big deal, but if the factory function is complicated, it might be. We can avoid this problem and simplify
the code using a defaultdict:

def all_anagrams(filename):
    d = defaultdict(list)
    for line in open(filename):             
        word = line.strip().lower() 
        t = signature(word)
        d[t].append(word)
    return d

Here's the definition of signature function:

def signature(s):
    """Returns the signature of this string.

    Signature is a string that contains all of the letters in order.

    s: string
    """
    # TODO: rewrite using sorted()
    t = list(s)
    t.sort()
    t = ''.join(t)
    return t 

What I understand regarding the second solution is that setdefault checks whether t (the signature of the word) exists as a key, if not, it sets it as a key and sets an empty list as its value, then append appends the word to it. If t exists, setdefault returns its value (a list with at least one item, which is a string representing a word), and append appends the word to this list.

What I understand regarding the third solution is that d, which represents a defaultdict, makes t a key andsets an empty list as its value (if t doesn't already exist as a key), then the word is appended to the list. If t does already exist, its value (the list) is returned, and to which the word is appended.

What is the difference between the second and third solutions? I What it means that the code in the second solution makes a new list every time, regardless of whether it's needed? How is setdefault responsible for that? How does using defaultdict make us avoid this problem? How are the second and third solutions different?

Best Answer

The key line is this one:

d.setdefault(t, []).append(word) return d

d.setdefault is just a function. That function is receiving two arguments: t, and a new list. Both of those arguments are evaluated before setdefault is called, which means the list is constructed, then setdefault is called, and might immediately discard the new list.

The internal implementation of setdefault might look like:

def setdefault(self, key, default):
    try:
        return self[key]
    except KeyError:
        self[key] = default
        return self[key]

Note that if key exists, default is never used! In the case of defaultdict, it only constructs a new object when you need it. An implementation could look something like:

class defaultdict(dict):

    def __init__(self, factory):
        self.factory = factory

    def __missing__(self, key):
        self[key] = self.factory()
        return self[key]

__missing__ is called by dictionaries when an item is not in the dictionary. In this case, an object is only created in that case.

As the author points out, lists are relatively cheap, but what if the object being constructed opened a file? Or a network socket? Then you have that overhead, only to throw it away immediately afterwards.

The downside of defaultdict, of course, is that all items in your dictionary get the same initial value. (The return value of whatever callable is provided to the defaultdict constructor.) Of course, you could always use the above as a sample to have different behavior if all you need is the key.

Note that the real implementations of both setdefault and defaultdict are probably significantly more complex -- I've glossed over many details and there are likely both performance enhancements and code to handle edge cases I haven't thought of.

Related Topic