(Or: Enforced Algebraic Structure of Commutative Accumulative Hash Functions)
While there are several documented approaches to defining a hash function for lists and other containers where iteration order is guaranteed, there seems to be less discussion around best practices for defining a hash function for unordered containers. One obvious approach is to simply sum or xor
the hashes of the individual elements of the container. A downside to these approaches is the existence of “problem elements” which hash to 0; when such elements are inserted into any container, that container’s hash will remain unchanged. One might suspect that this is due to the structured nature of addition or xor, and that a more clever choice of hash function on the unordered container could avoid this. In fact, at the end of the post, we’ll mathematically prove a proposition which roughly states that any general purpose method for hashing unordered containers, which can be incrementally updated based on the existing hash, is essentially equivalent to one of the more “obvious” choices in that it has the same algebraic structure, and in particular has the same “problem” elements.
Hash functions are usually valued in unsigned -bit integers, where
is 32 or 64. Formally, we’ll fix a choice of
and define a hash function on a given type to be any function from instances of that type to
, the integers modulo
. One obvious way to construct a hash for a container of a hashable type is to somehow accumulate the individual hashes of its constituent elements. That is, given a hash accumulator
and a global constant starting value , we can define a hash for a container
by
hash(c):
ret = seed
for t in c:
ret = h(ret, hash(t))
return ret
There are many examples of this pattern for ordered containers, such as lists:
- Boost Library’s hash_combine function follows this pattern, with
1
- In Java, the hash code for lists and strings follows this pattern with
- More generally, hashes for an arbitrary block of bytes sometimes follow this approach as well. For example
- The Fowler-Noll-Vo hash functions FNV-1 and FNV-1a use
respectively, where
is a specially chosen prime depending on the number of bits
.
- The Murmur family of hashes nearly fit into this pattern as well, where both the accumulator and individual hashes are some composition of invertible mixing functions, such as bit rotations, XORing with bit rotations, and addition or multiplication by a constant.
- The Fowler-Noll-Vo hash functions FNV-1 and FNV-1a use
A useful property of container hashes following this pattern is that they are incrementally updated: if one knows the current hash of the container, the hash after inserting (and in some cases, deleting) an element can be quickly computed by plugging the right values into the accumulator. Note that this is not possible in some versions of Murmur where the initial depends on the length of the block.
Now let’s turn to unordered containers. If one wants to follow the template above, the hash accumulator will need the additional property of
(Commutativity) .
In existing libraries, the offerings are more modest.
- The Boost documentation advises one to choose a canonical ordering of the elements and use the existing order-dependent hash_combine function.
- Java’s Sets and Maps have a Hashcode method defined as the sum of the individual hashes of the elements.
- Although one can easily check that addition satisfies (Commutativity), it is problematic for Integers, as their default hash is the identity function. So, for example, the sets
and
will have the same hash. This issue can be overcome by using an avalanching hash for the Integer type.
- More seriously, if an individual element of a given type hashes to zero, then inserting or removing it from any container will not alter the container’s hash.
- Although one can easily check that addition satisfies (Commutativity), it is problematic for Integers, as their default hash is the identity function. So, for example, the sets
- An alternative to summing is proposed in https://www.preprints.org/manuscript/201710.0192/v1. Unfortunately, in the notation of that paper, we run into the same issue for elements which hash to
: an arbitrary number of such elements can be inserted or removed from a container without changing the container’s hash.
These issues stand in contrast to the accumulators used in ordered containers: taking Boost’s accumulator,
we see that although for any fixed , there is an
which will not alter the hash (namely
), there is no single
for which the hash is unaltered for all
.
So is there a commutative hash accumulator which avoids this degenerate behavior? The answer, under the right assumptions, is no. In fact, something even stronger holds.
Let us fix notations. Let
be a hash accumulator and a starting seed. We’ll let
denote the curry of with respect to the
input, that is,
If we want to be able to incrementally update the hash of the container upon removal of an element, we will need the additional property of
(Invertibility)
Let denote all values reachable by accumulating hash values starting with
, that is,
For the purposes of analyzing the usefulness of the hash accumulator, we only care about its behavior on . By (Invertibility),
induces a map
where denotes the symmetric group of permutations on
.
Main Proposition.
- If
is a hash accumulator satisfying (Commutativity) and (Invertibility), then there is a subgroup
with
containing the image of
.
- If
is one-to-one, then
, and the image of
is a subgroup. That is,
is simply the binary operation for some group structure on
.
- If
is one-to-one, there exists
such that
for all
.2
Proof: The third point follows immediately from the second by taking to be the identity element of the group structure. The second point follows from the first by observing that the inequalities
must all be equalities.
Now, let be the subgroup of
generated by the image of
. By the Orbit-Stabilizer theorem,
. If we can show
, then we are done, since
as we’ve defined it.
It remains to show . Let
be any non-identity element of
; we will show
does not fix
. Since
is not the identity, there is some
such that
; moreover, by definition of
, there is some
such that
. Thus,
.
N.b. The group structure implied by the Proposition need not be the usual addition on . For example, if we define
, the corresponding group is isomorphic to
, an
-dimensional vector space over
.
Exercise: Show that any abelian group of order can be realized by an appropriately chosen hash accumulator
.
So what’s the takeaway?
If you want a commutative accumulative hash function that does not have these degenerate “no-op” elements, you will have to violate one of the assumptions we made. One possibility is to remove the assumption that the hash accumulator can only depend on the -bit integers
and
. For example, one could let the accumulation state
consist of
bits, where
, and hash it down to
bits on demand; this would have the effect of mapping the
possible accumulation values into a much larger abelian group of order
, where there are fewer relations among the elements of the image if the mapping is chosen correctly.3 Alternatively, one could leverage the internal state of the container in some way, for example by examining the multiplicity of an element being inserted into a multi-set; this has the disadvantage of not being generic. In any case, the main takeaway is that this problem will not be solved by choosing a sufficiently clever accumulation function.
Thanks to Jason Sundram for feedback on an earlier version of this post.
Footnotes
1. See this question if you’re curious about where the magic constant 2654435769 comes from.
2 . The assumption that is one-to-one is fairly reasonable; its purpose is to avoid degenerate counter-examples like
3. See BiggerHash for a bare-bones implementation with ,
, compatible with std::unordered_set.
Thanks for sharing this article — very helpful!
LikeLike