Accumulating dictionaries in Python

I often have a need to count tokens in a corpus. In Python, there are many ways to do this, but currently I most often use defaultdicts:

1
2
3
d = defaultdict(int)
for x in sequence:
  d[x] += 1

I would like to get rid of the for-loop and construct such a dictionary at once. I wrote a dict-derived class to do that, but it can do even more. But first, here is how I would do the above:

1
d = AccumulationDict(lambda x, y: x + y, [(x, 1) for x in sequence])

That’s it!

Notice how it takes a function as its first parameter. This is similar to how defaultdict takes a callable, but instead of taking a 0-arity callable, AccumulationDict takes a binary function. Whenever it “accumulates” a key-value for a key that already exists, the existing value and new value are sent to this function, and the result is what is set as the new value in the dictionary. This function will most likely be addition (rather than the lambda expression one could use operator.add), but it could be anything. Say you’re calculating probabilities of multiple events, you could use operator.mul.

I did not want to break KeyErrors, so accumulating is separate from getting and setting. This means you can still use __setitem__() and update() to reset the values of keys. Accumulation happens in the constructor, in dictionary addition, and with a new accumulate() function. accumulate() is identical to update() in interface, but uses the provided accumulator function to “merge” values when there are key collisions.

The code is below, but it represents a proof-of-concept and could be improved:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class AccumulationDict(dict):
    def __init__(self, accumulator, *args, **kwargs):
        if not hasattr(accumulator, '__call__'):
            raise TypeError('Accumulator must be a binary function.')
        self.accumulator = accumulator
        self.accumulate(*args, **kwargs)

    def __additem__(self, key, value):
        if key in self:
            self[key] = self.accumulator(self[key], value)
        else:
            self[key] = value

    def __add__(self, other):
        result = AccumulationDict(self.accumulator, self)
        result.accumulate(other)
        return result

    def accumulate(self, *args, **kwargs):
        for arg in args:
            if isinstance(arg, list):
                for (key, value) in arg:
                    self.__additem__(key, value)
            elif isinstance(arg, dict):
                for (key, value) in arg.items():
                    self.__additem__(key, value)
            else:
                raise TypeError('Argument must be of type list or dict.')
        for key in kwargs:
            self.__additem__(key, kwargs[key])
This entry was posted in Programming, Software and tagged , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

3 Comments

  1. Nathan
    Posted 2012.06.26 at 15:36 | Permalink

    For counting things I use the

    1
    collections.Counter

    class. In addition to giving you a one-liner for obtaining counts from a sequence or other iterable, the

    1
    most_common()

    method makes it easy to retrieve the highest-count item(s).

  2. goodmami
    Posted 2012.07.01 at 02:42 | Permalink

    Thanks Nathan,

    I recall avoiding Counter because the behavior was not ideal for what I wanted to do (and this may still be the case, but I can’t remember exactly what the problem was… maybe it was performance-related), but in recent versions it seems quite capable. I can add two counters to get combined counts, instantiate a counter with a generator, etc. Arguably the AccumulatorDict described in this post is more versatile (modulo functions like most_common()), in that it allows the user to define the function to use when adding (or accumulating) values for keys that already exist.

    But thanks for reminding of me Counter. I should try it out next time.

  3. goodmami
    Posted 2013.02.18 at 14:18 | Permalink

    It’s been a while, but I’m coming back to this because I recalled the problem I had with Counter. Actually two problems:

    First, it’s great at incrementing, but not tallying up counts other than 1. For instance, I may already have a list of tokens with counts per line, and I want to find the token counts for a document. Consider sentences like “a dog is a canine” and “a cat is a feline”, with the following counts:

    1
    toks = [('a', 2), ('dog', 1), ('is', 1), ('canine', 1), ('a', 2), ('cat', 1), ('is', 1), ('feline', 1)]

    In this case, I can just use those token counts as input to AccumulationDict and it will add the counts. With counter, it would take (‘a’,2) as a unique token with 2 occurrences (rather than ‘a’ as a token with 4 occurrences). If I have the token-counts-per-line as a list for each line, I could probably create a Counter for each one, then add the counters together, but that might be expensive.

    The second problem is with adding non-integer values. Consider I have a key with a list as the value, and I want to combine lists with the same key. Counter apparently assumes the value will be an integer:

    1
    2
    3
    4
    5
    6
    >>> c = Counter(a=['a']) + Counter(a=['b','c'])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.3/collections/__init__.py", line 621, in __add__
        if newcount > 0:
    TypeError: unorderable types: list() > int()

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>