Hashing Unordered Sets: How Far Will Cleverness Take You?

(Or: Enforced Algebraic Structure of Commutative Accumulative Hash Functions)

While there are several documented approaches to defining a hash function for lists and other containers where iteration order is guaranteed, there seems to be less discussion around best practices for defining a hash function for unordered containers. One obvious approach is to simply sum {(+)} or xor {(\oplus)} the hashes of the individual elements of the container. A downside to these approaches is the existence of “problem elements” which hash to 0; when such elements are inserted into any container, that container’s hash will remain unchanged. One might suspect that this is due to the structured nature of addition or xor, and that a more clever choice of hash function on the unordered container could avoid this. In fact, at the end of the post, we’ll mathematically prove a proposition which roughly states that any general purpose method for hashing unordered containers, which can be incrementally updated based on the existing hash, is essentially equivalent to one of the more “obvious” choices in that it has the same algebraic structure, and in particular has the same “problem” elements.

Hash functions are usually valued in unsigned {n}-bit integers, where {n} is 32 or 64. Formally, we’ll fix a choice of {n} and define a hash function on a given type to be any function from instances of that type to {\mathbb{Z}_{2^n}}, the integers modulo {2^n}. One obvious way to construct a hash for a container of a hashable type is to somehow accumulate the individual hashes of its constituent elements. That is, given a hash accumulator

h(a,x):\mathbb{Z}_{2^n}\times \mathbb{Z}_{2^n} \rightarrow \mathbb{Z}_{2^n},

and a global constant starting value {seed}, we can define a hash for a container {c} by
hash(c):
ret = seed
for t in c:
ret = h(ret, hash(t))
return ret

There are many examples of this pattern for ordered containers, such as lists:

  • Boost Library’s hash_combine function follows this pattern, with h(a,x) = a \oplus (x + 2654435769 + a\ll6 + a\gg2).1
  • In Java, the hash code for lists and strings follows this pattern with h(a,x) = 31a+x.
  • More generally, hashes for an arbitrary block of bytes sometimes follow this approach as well. For example
    • The Fowler-Noll-Vo hash functions FNV-1 and FNV-1a use h(a,x) = (a*p)\oplus x\text{ and }h(a,x) = (a\oplus x)*p, respectively, where p is a specially chosen prime depending on the number of bits n.
    • The Murmur family of hashes nearly fit into this pattern as well, where both the accumulator and individual hashes are some composition of invertible mixing functions, such as bit rotations, XORing with bit rotations, and addition or multiplication by a constant.

A useful property of container hashes following this pattern is that they are incrementally updated: if one knows the current hash of the container, the hash after inserting (and in some cases, deleting) an element can be quickly computed by plugging the right values into the accumulator. Note that this is not possible in some versions of Murmur where the initial {seed} depends on the length of the block.

Now let’s turn to unordered containers. If one wants to follow the template above, the hash accumulator will need the additional property of

(Commutativity) {h(h(a,x), y) = h(h(a,y), x)} \text{ for all } a,x,y.

In existing libraries, the offerings are more modest.

  • The Boost documentation advises one to choose a canonical ordering of the elements and use the existing order-dependent hash_combine function.
  • Java’s Sets and Maps have a Hashcode method defined as the sum of the individual hashes of the elements.
    • Although one can easily check that addition satisfies (Commutativity), it is problematic for Integers, as their default hash is the identity function. So, for example, the sets {\{1,2\}} and {\{3\}} will have the same hash. This issue can be overcome by using an avalanching hash for the Integer type.
    • More seriously, if an individual element of a given type hashes to zero, then inserting or removing it from any container will not alter the container’s hash.
  • An alternative to summing is proposed in https://www.preprints.org/manuscript/201710.0192/v1. Unfortunately, in the notation of that paper, we run into the same issue for elements which hash to {(1-q)/r}: an arbitrary number of such elements can be inserted or removed from a container without changing the container’s hash.

These issues stand in contrast to the accumulators used in ordered containers: taking Boost’s accumulator,

h(a,x) = a \oplus (x + 2654435769 + a\ll6 + a\gg2),

we see that although for any fixed {a}, there is an {x} which will not alter the hash (namely {x = -2654435769 - a\ll6 - a\gg2}), there is no single {x} for which the hash is unaltered for all {a}.

So is there a commutative hash accumulator which avoids this degenerate behavior? The answer, under the right assumptions, is no. In fact, something even stronger holds.

Let us fix notations. Let

h(a,x):\mathbb{Z}_{2^n}\times \mathbb{Z}_{2^n} \rightarrow \mathbb{Z}_{2^n}

be a hash accumulator and {s\in\mathbb{Z}_{2^n}} a starting seed. We’ll let

f: \mathbb{Z}_{2^n}\rightarrow (\mathbb{Z}_{2^n} \rightarrow \mathbb{Z}_{2^n})

denote the curry of {h} with respect to the {x} input, that is,

f: x \mapsto (a\mapsto h(a,x)).

If we want to be able to incrementally update the hash of the container upon removal of an element, we will need the additional property of

(Invertibility) \text{For any fixed }{x}\text{, } {f(x)}\text{ is a bijective function.}

Let {\text{\emph{range}}(s)\subset \mathbb{Z}_{2^n}} denote all values reachable by accumulating hash values starting with s, that is,

\text{\emph{range}}(s) = \{h(h(\ldots (h(s, x_1), x_2),\ldots ), x_i) : x_1,\ldots, x_i \in \mathbb{Z}_{2^n}\}.

For the purposes of analyzing the usefulness of the hash accumulator, we only care about its behavior on {\text{\emph{range}}(s)}. By (Invertibility), {f} induces a map

\begin{aligned} f_s: \mathbb{Z}_{2^n}&\rightarrow S_{\text{\emph{range}}(s)} \\ x&\mapsto (a\mapsto h(a,x)),\end{aligned}

where {S_{\text{\emph{range}}(s)}} denotes the symmetric group of permutations on {\text{\emph{range}}(s)}.

Main Proposition.

  • If {h} is a hash accumulator satisfying (Commutativity) and (Invertibility), then there is a subgroup {G\subset S_{\text{\emph{range}}(s)}} with {|G| \leq |\text{\emph{range}}(s)|} containing the image of {f_s}.
  • If {f_s} is one-to-one, then {\text{\emph{range}}(s) = \mathbb{Z}_{2^n}}, and the image of {f_s} is a subgroup. That is, {h} is simply the binary operation for some group structure on {\mathbb{Z}_{2^n}}.
  • If {f_s} is one-to-one, there exists {x} such that {h(a,x) = a} for all {a\in \mathbb{Z}_{2^n}}.2

Proof: The third point follows immediately from the second by taking {x} to be the identity element of the group structure. The second point follows from the first by observing that the inequalities

2^n = |im(f_s)| \leq |G| \leq |\text{\emph{range}}(s)| \leq 2^n

must all be equalities.

Now, let {G} be the subgroup of {S_{range(s)}} generated by the image of {f_s}. By the Orbit-Stabilizer theorem, {|G| = |Orb(s)|*|Stab(s)|}. If we can show {|Stab(s)|=1}, then we are done, since {Orb(s) = \text{\emph{range}}(s)} as we’ve defined it.

It remains to show {|Stab(s)| = 1}. Let {g} be any non-identity element of {G}; we will show {g} does not fix {s}. Since {g} is not the identity, there is some {a\in range(s)} such that {ga \neq a}; moreover, by definition of {range(s)}, there is some {h\in G} such that {hs = a}. Thus, {gs = (h^{-1}gh)s = h^{-1}g(hs) = h^{-1}ga \neq h^{-1}a = s}. \Box

N.b. The group structure implied by the Proposition need not be the usual addition on {\mathbb{Z}_{2^n}}. For example, if we define {h(a,x) = a\oplus x}, the corresponding group is isomorphic to {(\mathbb{Z}_2)^n}, an {n}-dimensional vector space over {\mathbb{Z}_2}.

Exercise: Show that any abelian group of order {2^n} can be realized by an appropriately chosen hash accumulator {h}.

So what’s the takeaway?

If you want a commutative accumulative hash function that does not have these degenerate “no-op” elements, you will have to violate one of the assumptions we made. One possibility is to remove the assumption that the hash accumulator can only depend on the {n}-bit integers {a} and {x}. For example, one could let the accumulation state {a} consist of {m} bits, where {m>n}, and hash it down to {n} bits on demand; this would have the effect of mapping the {2^n} possible accumulation values into a much larger abelian group of order {2^m}, where there are fewer relations among the elements of the image if the mapping is chosen correctly.3 Alternatively, one could leverage the internal state of the container in some way, for example by examining the multiplicity of an element being inserted into a multi-set; this has the disadvantage of not being generic. In any case, the main takeaway is that this problem will not be solved by choosing a sufficiently clever accumulation function.

Footnotes

1. See this question if you’re curious about where the magic constant 2654435769 comes from.

2 . The assumption that {f} is one-to-one is fairly reasonable; its purpose is to avoid degenerate counter-examples like

h(a,x) = \left\{ \begin{array}{ll} a+x \text{ if } x \neq 0\\ a+1 \text{ otherwise. } \end{array} \right.

3. See BiggerHash for a bare-bones implementation with n = 64, m=128, compatible with std::unordered_set.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s