Short version: What's the best hashing algorithm for a multiset implemented as a dictionary of unordered items?
I'm trying to hash an immutable multiset (which is a bag or multiset in other languages: like a mathematical set except that it can hold more than one of each element) implemented as a dictionary. I've created a subclass of the standard library class collections.Counter, similar to the advice here: Python hashable dicts, which recommends a hash function like so:
class FrozenCounter(collections.Counter): # ... def __hash__(self): return hash(tuple(sorted(self.items()))) Creating the full tuple of items takes up a lot of memory (relative to, say, using a generator) and hashing will occur in an extremely memory intensive part of my application. More importantly, my dictionary keys (multiset elements) probably won't be order-able.
I'm thinking of using this algorithm:
def __hash__(self): return functools.reduce(lambda a, b: a ^ b, self.items(), 0) I figure using bitwise XOR means order doesn't matter for the hash value unlike in the hashing of a tuple? I suppose I could semi-implement the Python tuple-hashing alogrithm on the unordered stream of tuples of my data. See https://github.com/jonashaag/cpython/blob/master/Include/tupleobject.h (search in the page for the word 'hash') -- but I barely know enough C to read it.
Thoughts? Suggestions? Thanks.
(If you're wondering why I'm messing around with trying to hash a multiset: The input data for my problem are sets of multisets, and within each set of multisets, each multiset must be unique. I'm working on a deadline and I'm not an experienced coder, so I wanted to avoid inventing new algorithms where possible. It seems like the most Pythonic way to make sure I have unique of a bunch of things is to put them in a
set(), but the things must be hashable.) What I've gathered from the comments
Both @marcin and @senderle gave pretty much the same answer: use hash(frozenset(self.items())). This makes sense because items() "views" are set-like. @marcin was first but I gave the check mark to @senderle because of the good research on the big-O running times for different solutions. @marcin also reminds me to include an __eq__ method -- but the one inherited from dict will work just fine. This is how I'm implementing everything -- further comments and suggestions based on this code are welcome:
class FrozenCounter(collections.Counter): # Edit: A previous version of this code included a __slots__ definition. # But, from the Python documentation: "When inheriting from a class without # __slots__, the __dict__ attribute of that class will always be accessible, # so a __slots__ definition in the subclass is meaningless." # http://docs.python.org/py3k/reference/datamodel.html#notes-on-using-slots # ... def __hash__(self): "Implements hash(self) -> int" if not hasattr(self, '_hash'): self._hash = hash(frozenset(self.items())) return self._hash
tupletakes a lot of memory? It's just a "pointer" to each item in the dict, not a copy of it, that gets created.