I often have a need to count tokens in a corpus. In Python, there are many ways to do this, but currently I most often use defaultdicts:
1 2 3 | d = defaultdict(int) for x in sequence: d[x] += 1 |
I would like to get rid of the for-loop and construct such a dictionary at once. I wrote a dict-derived class to do that, but it can do even more. But first, here is how I would do the above:
1 | d = AccumulationDict(lambda x, y: x + y, [(x, 1) for x in sequence]) |
That’s it!
Notice how it takes a function as its first parameter. This is similar to how defaultdict takes a callable, but instead of taking a 0-arity callable, AccumulationDict takes a binary function. Whenever it “accumulates” a key-value for a key that already exists, the existing value and new value are sent to this function, and the result is what is set as the new value in the dictionary. This function will most likely be addition (rather than the lambda expression one could use operator.add), but it could be anything. Say you’re calculating probabilities of multiple events, you could use operator.mul.
I did not want to break KeyErrors, so accumulating is separate from getting and setting. This means you can still use __setitem__() and update() to reset the values of keys. Accumulation happens in the constructor, in dictionary addition, and with a new accumulate() function. accumulate() is identical to update() in interface, but uses the provided accumulator function to “merge” values when there are key collisions.
The code is below, but it represents a proof-of-concept and could be improved:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | class AccumulationDict(dict): def __init__(self, accumulator, *args, **kwargs): if not hasattr(accumulator, '__call__'): raise TypeError('Accumulator must be a binary function.') self.accumulator = accumulator self.accumulate(*args, **kwargs) def __additem__(self, key, value): if key in self: self[key] = self.accumulator(self[key], value) else: self[key] = value def __add__(self, other): result = AccumulationDict(self.accumulator, self) result.accumulate(other) return result def accumulate(self, *args, **kwargs): for arg in args: if isinstance(arg, list): for (key, value) in arg: self.__additem__(key, value) elif isinstance(arg, dict): for (key, value) in arg.items(): self.__additem__(key, value) else: raise TypeError('Argument must be of type list or dict.') for key in kwargs: self.__additem__(key, kwargs[key]) |
3 Comments
For counting things I use the
class. In addition to giving you a one-liner for obtaining counts from a sequence or other iterable, the
method makes it easy to retrieve the highest-count item(s).
Thanks Nathan,
I recall avoiding Counter because the behavior was not ideal for what I wanted to do (and this may still be the case, but I can’t remember exactly what the problem was… maybe it was performance-related), but in recent versions it seems quite capable. I can add two counters to get combined counts, instantiate a counter with a generator, etc. Arguably the AccumulatorDict described in this post is more versatile (modulo functions like most_common()), in that it allows the user to define the function to use when adding (or accumulating) values for keys that already exist.
But thanks for reminding of me Counter. I should try it out next time.
It’s been a while, but I’m coming back to this because I recalled the problem I had with Counter. Actually two problems:
First, it’s great at incrementing, but not tallying up counts other than 1. For instance, I may already have a list of tokens with counts per line, and I want to find the token counts for a document. Consider sentences like “a dog is a canine” and “a cat is a feline”, with the following counts:
In this case, I can just use those token counts as input to AccumulationDict and it will add the counts. With counter, it would take (‘a’,2) as a unique token with 2 occurrences (rather than ‘a’ as a token with 4 occurrences). If I have the token-counts-per-line as a list for each line, I could probably create a Counter for each one, then add the counters together, but that might be expensive.
The second problem is with adding non-integer values. Consider I have a key with a list as the value, and I want to combine lists with the same key. Counter apparently assumes the value will be an integer:
2
3
4
5
6
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.3/collections/__init__.py", line 621, in __add__
if newcount > 0:
TypeError: unorderable types: list() > int()