In Think Python the author introduces defaultdict. The following is an excerpt from the book regarding defaultdict:
If you are making a dictionary of lists, you can often write simpler
code using defaultdict. In my solution to Exercise 12-2, which you can
get from http://thinkpython2.com/code/anagram_sets.py, I make a
dictionary that maps from a sorted string of letters to the list of
words that can be spelled with those letters. For example, 'opst' maps
to the list ['opts', 'post', 'pots', 'spot', 'stop', 'tops']. Here’s
the original code:def all_anagrams(filename): d = {} for line in open(filename): word = line.strip().lower() t = signature(word) if t not in d: d[t] = [word] else: d[t].append(word) return d
This can be simplified using setdefault, which you might have used in Exercise 11-2:
def all_anagrams(filename): d = {} for line in open(filename): word = line.strip().lower() t = signature(word) d.setdefault(t, []).append(word) return d
This solution has the drawback that it makes a new list every time, regardless of whether it is needed. For lists, that’s no big deal, but if the factory function is complicated, it might be. We can avoid this problem and simplify
the code using a defaultdict:def all_anagrams(filename): d = defaultdict(list) for line in open(filename): word = line.strip().lower() t = signature(word) d[t].append(word) return d
Here's the definition of signature
function:
def signature(s):
"""Returns the signature of this string.
Signature is a string that contains all of the letters in order.
s: string
"""
# TODO: rewrite using sorted()
t = list(s)
t.sort()
t = ''.join(t)
return t
What I understand regarding the second solution is that setdefault
checks whether t
(the signature of the word) exists as a key, if not, it sets it as a key and sets an empty list as its value, then append
appends the word to it. If t
exists, setdefault
returns its value (a list with at least one item, which is a string representing a word), and append
appends the word to this list.
What I understand regarding the third solution is that d
, which represents a defaultdict, makes t
a key andsets an empty list as its value (if t
doesn't already exist as a key), then the word is appended to the list. If t
does already exist, its value (the list) is returned, and to which the word is appended.
What is the difference between the second and third solutions? I What it means that the code in the second solution makes a new list every time, regardless of whether it's needed? How is setdefault
responsible for that? How does using defaultdict make us avoid this problem? How are the second and third solutions different?
Best Answer
The key line is this one:
d.setdefault
is just a function. That function is receiving two arguments: t, and a new list. Both of those arguments are evaluated before setdefault is called, which means the list is constructed, then setdefault is called, and might immediately discard the new list.The internal implementation of
setdefault
might look like:Note that if
key
exists,default
is never used! In the case ofdefaultdict
, it only constructs a new object when you need it. An implementation could look something like:__missing__
is called by dictionaries when an item is not in the dictionary. In this case, an object is only created in that case.As the author points out, lists are relatively cheap, but what if the object being constructed opened a file? Or a network socket? Then you have that overhead, only to throw it away immediately afterwards.
The downside of defaultdict, of course, is that all items in your dictionary get the same initial value. (The return value of whatever callable is provided to the defaultdict constructor.) Of course, you could always use the above as a sample to have different behavior if all you need is the key.
Note that the real implementations of both
setdefault
anddefaultdict
are probably significantly more complex -- I've glossed over many details and there are likely both performance enhancements and code to handle edge cases I haven't thought of.