2-adic Logarithms and Fast Exponentiation

In this post, we’re going to investigate an underexplored bridge between computer science and algebraic number theory. To motivate it, consider the analogy between floating point arithmetic and the theoretical real numbers. While floating points can only approximate the precision of a real number, much of what is rigorously established in the theoretical world carries over to the computational world. For example, the chain rule from calculus powers backpropagation of neural nets, and the spectral theorem from linear algebra plays a key role in principal component analysis. Here we are concerned with a similar analogy which applies to integer types rather than floating point types.

double: {\mathbb{R}} :: {\quad}int: (?)

One obvious theoretical candidate is the ring of integers {\mathbb{Z}}. For our purposes, however, a far more useful candidate is the ring of 2-adic integers {\mathbb{Z}_2}. For a primer on this connection and how it is used today to optimize pointer arithmetic in compilers, see this excellent blog post from Dan Piponi. Intuitively, the 2-adic integers are what happens when one performs unsigned integer arithmetic with an infinite number of leading bits. Although this space only exists theoretically, it can serve as a useful mental model in practice: for any finite n, one can “reduce” a 2-adic integer mod 2n to get a standard n-bit number, and this reduction is compatible with addition and multiplication.

Our motivation for using this space is that {\mathbb{Z}_2} has many nice properties which {\mathbb{Z}} lacks, but which are nevertheless shared with the finite reductions {\mathbb{Z}/2^n}. For example, in {\mathbb{Z}_2}, every odd number has a multiplicative inverse, and exact division by that odd number is equivalent to multiplication by its inverse. This is what enables compilers to turn some divisions into multiplications at compile time. In contrast, only {\pm 1} have multiplicative inverses in the usual ring of integers {\mathbb{Z}}.

Another nice property of {\mathbb{Z}_2} is that it is a complete metric space under the 2-adic metric, which enables one to use analytic methods such as Taylor series and a 2-adic version of Newton’s method. There are also 2-adic analogues of the usual exponential and logarithm functions, which we’ll denote by {\exp_2} and {\log_2}. These methods and functions are compatible with passing to the reductions {\mathbb{Z}/2^n}.

Now fix a choice of {n}. We’ll dive more into the details in a bit, but our final goal will be to leverage these methods to derive an algorithm for computing the power of one n-bit integer by another1. This algorithm will only require {O(\sqrt{n})} multiplications. In comparison, the method of repeated squaring requires {O(n)} multiplications on average. In practice, our C++ implementation for {n=64} shows a ~4x speedup over repeated squaring. In a later post, we will derive an algorithm for computing the k-th root of a perfect k-th power integer which is again demonstrably faster than existing approaches using the traditional Newton’s method.

Theoretical Background

Before giving a precise definition, we’ll state some properties of {\log_2} and {\exp_2} and see how they plug into our final algorithm. One thing to keep in mind is that these are not in any sense approximations to the natural logarithm and exponential, but rather analogues which operate in a fundamentally different realm.

The first thing to understand about a function is its domain and range. In most textbooks on p-adic analysis, the 2-adic logarithm would be defined on the entire space of 2-adic numbers. However, since we are trying to do actual computations, we will instead fix an integer {n} (take {n=64} say), and define it for the space of unsigned n-bit integers {\mathbb{Z}/2^n}. With this convention, we can specify the domain and range of {\log_2} with the following function signature:

\displaystyle \log_{2}: 1+4\mathbb{Z}/2^n \rightarrow 4\mathbb{Z}/2^n.

That is, the domain of {\log_2} is the set of n-bit integers congruent to {1(\text{mod }4)}, i.e. whose trailing two bits are 01, and its range is the set of integers congruent to {0 (\text{mod }4)}, i.e. whose trailing two bits are 00.

The 2-adic exponential has the opposite function signature:

\displaystyle \exp_{2}: 4\mathbb{Z}/2^n \rightarrow 1+4\mathbb{Z}/2^n.

Together, these two functions satisfy the inversion formulas

\displaystyle\exp_{2}(\log_{2}(a)) = a,

\displaystyle\log_{2}(\exp_{2}(a)) = a

on their respective domains. They also obey the same sum-product identities as the usual exponential and logarithm:

\displaystyle \log_{2}(ab) = \log_{2}(a)+\log_{2}(b),

\displaystyle \exp_{2}(a+b) = \exp_{2}(a)\exp_{2}(b).

N.b. We are using modulo addition and multiplication for n-bit integers.

Putting this in more theoretical language, {\log_2} exhibits an isomorphism between the group {1+4\mathbb{Z}/2^n} under multiplication and the group {4\mathbb{Z}/2^n} under addition, and {\exp_2} is its inverse.

Aside: So far this all sounds vaguely similar to the usual logarithm and exponential, and our choice of n is analogous to choosing the number of bits of precision in the floating point world. A crucial difference is that both the inversion and sum-product formulas are exact even after we have fixed n. Contrast this with the world of floats where even simple operations like addition incur rounding errors.

So how does this help with our exponentiation and root algorithms? First note that the sum identity above immediately implies

\displaystyle \log_{2}(a^x) = x\log_{2}(a).

Thus, when {a \equiv 1 (\text{mod }4)}, we can compute {a^x} as

\displaystyle a^x = \exp_{2}(\log_{2}(a^x)) = \exp_{2}(x\log_{2}(a)).

One can extend this to all {a \not\equiv 1 (\text{mod }4)} by observing that integer powers of both {-1} and {2} are cheap to compute2.

Summarizing, we have “reduced” the integer exponential calculation to an {\exp_2,} a {\log_2,} and a single multiplication. We will see in a future post that this works just as well if x is a fraction, which enables us to compute roots. Thus, the performance of our exponentiation and root algorithms come down to how efficiently we can compute {\exp_2} and {\log_2}.

Precise Definitions

Now that we’ve seen how these functions plug into our algorithm, let’s actually define them. In short, they are defined by the usual convergent Taylor series:

\displaystyle \log_2(1+x) = x-\frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} + \ldots
\displaystyle \exp_2(x) = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \ldots

where {x\in 4\mathbb{Z}/2^n}.

Those fractions don’t look like integers. For odd integers in the denominator, the division is really a shorthand for multiplication by the 2-adic multiplicative inverse. For powers of two in the denominator, we do mean the usual notion of division, but our assumption that {x} is a multiple of {4} guarantees that the numerator will always be more “2-divisible” than the denominator, so the result is always still a (2-adic) integer.

What does “convergent” mean in this context? The convergence here is with respect to the {2}-adic topology. Concretely, it means that for any finite {n}, eventually the terms will all be divisible by {2^n}, and so the {n}-bit representations of those terms is 0. Thus, for any fixed {n}, the series is actually a finite sum.

To give a few concrete examples, for {n=64} we have

  • {\log_2(1) = 0}
  • {\log_2(5) =6713115954038056572}
  • {\log_2(9) =6165135171829223912}

as numbers modulo 264.

How many terms do we actually need to compute? That depends on {n}. For a given unsigned integer {a}, define {v_2(a)}, the “2-divisibility” of {a} as follows:

  • The exponent of the largest power of 2 which divides {a}.
  • Equivalently, the number of trailing zeros in the base-2 representation of {a}.

In typical mathematical conventions, {v_2(0)} is set to be {\infty}. In practice, the function ctz and its variants offered by many modern CPU architectures computes {v_2}, although the result is undefined when 0 is passed as an argument.

This function has the following properties:

  • \displaystyle v_2(ab) = v_2(a) + v_2(b)
  • \displaystyle v_2(a+b) = \min(v_2(a),v_2(b))\quad \text{if } v_2(a)\neq v_2(b)
  • \displaystyle v_2(a+b) \geq \min(v_2(a),v_2(b))\quad \text{if } v_2(a)= v_2(b)
  • \displaystyle v_2(a) = v_2(\exp_2(a)-1)\quad \text{for }a\in 4\mathbb{Z}/2^n
  • \displaystyle v_2(a) = v_2(\log_2(1+a))\quad \text{for }a\in 4\mathbb{Z}/2^n

In this notation, we only need to include the terms of the Taylor series for which {v_2(\text{term}) < n}. Getting back to our functions {\log_2(1+x)} and {\exp_2(x)}, we are assuming in both cases that 4 divides {x}, and so therefore {v_2(x) \geq 2}. Using this, one can derive lower bounds for the terms of the series for {\log_2} and {\exp_2}:

\displaystyle v_2\left(\frac{x^k}{k}\right) \geq 2k-\lg(k),
\displaystyle v_2\left(\frac{x^k}{k!}\right) \geq 2k-k = k.

Exercise: Prove these estimates using the above properties.

From here, one can estimate that {\log_2} will require no more than {(n+\lg n)/2} terms, while {\exp_2} will require no more than {n} terms.

More generally, if {v_2(x) \geq m}, we have

\displaystyle v_2\left(\frac{x^k}{k}\right) \geq mk-\lg(k),
\displaystyle v_2\left(\frac{x^k}{k!}\right) \geq mk-k = (m-1)k,

in which case {\log_2} requires no more than {(n+\lg n)/m} terms, and {\exp_2} requires no more than {n/(m-1)} terms. More coarsely, each requires about {n/m} terms.

The Code

First Pass

Now that we have laid the theoretical foundation, we can start working on an implementation. For simplicity, we will now assume {n=64}. The above estimates suggest we’ll need up to about the 35th term of the Taylor series for {\log_2}, and up to the 64th term for {\exp_2}. A more precise calculation yields that for {\log_2}, we only need up to the 32nd term, while for {\exp_2}, we only need up to the 58th term. Without further ado, here is the code.

Great! So we’ve effectively turned exponentiation into multiplication via a pair of functions which… each require over 50 multiplications. Hm. Not great. The problem is that the number of terms of the Taylor series is proportional to the number of bits, and each additional term typically requires one or two additional multiplies3. In the next section, we’ll see how to cut this number down.

Optimizing the number of terms

The key observation is that the number of terms we need to compute depends greatly on our assumption that {v_2(x) \geq 2}. We saw earlier that when {x} has higher 2-adic order, say {v_2(x) \geq m}, the series converges much faster. How can we leverage this faster convergence while still handling all possible input values?

One way to achieve this it to use a lookup table to adjust the lower order bits of {x}, while modifying the final answer with pre-computed values. Let’s make this explicit.

First, pre-compute

\displaystyle \log_2(1+2^2),\\ \log_2(1+2^3),\\ \log_2(1+2^4),\\ \ldots\\ \log_2(1+2^{m-1}).

Then, at runtime, if {v_2(x)\geq k}, we can reduce to the case {v_2(x)\geq k+1} as follows:

\displaystyle \log_2(1+x)= \begin{cases} \log_2\left((1+x)(1+2^k)\right){-}\log_2(1+2^k)\quad & \text{if } v_2(x)= k,\\ \log_2(1+x)\quad & \text{if } v_2(x)\geq k+1.\\ \end{cases} \ \ \ \ \ (7)

\displaystyle \exp_2(x) = \begin{cases} (1+2^k)\exp_2(x-\log_2(1+2^k))\quad & \text{if } v_2(x)= k,\\ \exp_2(x)\quad & \text{if } v_2(x)\geq k+1.\\ \end{cases} \ \ \ \ \ (8)

Exercise: Show that if {v_2(x)= k}, then

  • {v_2((1+x)(1+2^k)-1) \geq k+1}
  • {v_2(x-\log_2(1+2^k)) \geq k+1}.

In both cases, we are able to bump up the valuation of {x} at the cost of a lookup, a subtraction, and a multiplication. Moreover, since the multiplication is always by {1+2^k}, it can be further optimized to a shift-add.

Now how do we choose the number of lookups {m}? From our earlier estimates, the number of Taylor terms we need to compute is about {n/m}. Taylor terms and lookups each require about one multiplication, suggesting the optimal {m} is that which minimizes {m + n/m}. This is minimized by {\sqrt{n}=8}. As it turns out, {m=10} is actually slightly better and only requires 6 terms of the Taylor series for {\log_2} and {\exp_2}, which only require 4 and 5 multiplications respectively.

The code for the optimized power function can be found here, alongside a basic implementation of the power function using repeated squaring.

One last optimization

Before we get to the benchmarks, there is one more optimization to make that is specific to the exponentiation problem, which is to combine ingredients from the repeated squaring and 2-adic approaches. Namely, instead of using a lookup table to compute {\log_2} and {\exp_2} for the lower-order bits, we directly compute the final answer for those bits without passing into “log space” at all. More explicitly, we can decompose the problem as

\displaystyle a^b = a^{b'+2^{m-2}b''} = a^{b'} (a^{2^{m-2}})^{b''}, \ \ \ \ \ (9)

where {b' < 2^{m-2}}. Computing the first term directly with repeated squaring requires fewer operations than using the lookup table approach. For the second term, we still compute it as

\displaystyle (a^{2^{m-2}})^{b''} = \exp_2\left(b''\log_2\left(a^{2^{m-2}}\right)\right).

However, in general when {a} is odd, we have

\displaystyle {a^{2^{(m-2)}}\equiv 1\text{(mod }2^{m})}.

Thus, both {\log_2} and {\exp_2} can be computed directly using only about {n/m} terms of the Taylor series.

The code for this combined approach is here.


Now that we have something a bit more reasonable, let’s put it to the test on the exponentiation problem mentioned at the beginning, comparing to the usual approach of repeated squaring. Our benchmark takes a random 64-bit integer as the exponent and a random odd 64-bit integer as the base. An even base will almost always exponentiate to zero, which unfairly benefits the 2-adic power function as it will return early.

The benchmark code can be found here. For the task of computing {10^6} randomly chosen powers on an Intel i7, we see the following results for the standard repeated squaring approach, the “adic” approach with lookup tables, and the combined approach, respectively:

Standard: 591ms
Adic: 266ms
Combined: 151ms

Closing Thoughts

We have seen that leveraging 2-adic analytic techniques can pay dividends in the computational world. Although we initially set our sights on optimizing a specific computational task, we hope that this exposition makes a compelling case that there is more to be explored in this area.

For example, nearly everything we’ve talked about can be generalized to arithmetic modulo {p^n} for any prime {p}. A natural question is whether this has applications to cryptography, where one is often performing modular exponentiation. One caveat to the methods here is that they will not help with computations mod {p}, but rather with the computations mod {p^n} outside the “purely mod {p}” aspects. This is similar to how we used powers of {-1} and {2} to reduce to the case {1\text{ (mod }4)}. That said, higher prime powers do appear in some cryptographic protocols, e.g. prime power RSA.

We conclude with a few open questions:

  • {\log_2} and {\exp_2} can each be computed in {O(\sqrt{n})} multiplications. Using Harvey-Hoeven multiplication, this yields a complexity upper bound of {O(n^{1.5}\lg n)}. What is the true minimum complexity of {\log_2} and {\exp_2}?
    • See also footnote 3 below.
  • Can the cases {x \not \equiv 1\text{(mod }4\text{)}} be handled more elegantly? Is there an efficient branchless implementation of {\log_2} and {\exp_2}?
  • In the extreme case where {x,y \equiv 1\text{(mod }2^{n/2}\text{)}}, we only need up to the linear term of the Taylor series for {\log_2} and {\exp_2}. Thus, multiplication reduces to addition via the identity

\displaystyle {xy \equiv x+y-1\text{(mod }2^n\text{)}}.

    Is there a more general “2-adic” approach to multiplication that is competitive with FFT-based methods?

Thanks to Nathan Bronson for feedback on an earlier version of this post.


1The benchmarks and complexity analysis are done for randomly chosen {n-}bit integers {a} and {b}. In particular, with probability {\sim 1}, the exponentiation {a^b} will wrap around in {n-} bit space; that is, {a^b > 2^n} as natural numbers, and we are only computing its value mod {2^n}. For choices of {a} and {b} which do not wrap around, repeated squaring will be much faster, since then {b < n} and so at most {O(\lg n)} multiplications are needed.

2Any nonzero integer can be written uniquely as {\pm 2^kx} where {x \equiv 1(\text{mod }4)}. Powers of -1 depend only on the parity of the exponent, and powers of 2 can be computed as a bit-shift.

3It is not obvious or even true in general that the number of multiplications required to compute a partial Taylor series is linear in the number of terms. For example, one can compute the partial sum of the first {2^k} terms of

\displaystyle \frac{1}{1-x} = 1+x+x^2+...

in only {2k-2} multiplications via

\displaystyle 1+x+x^2+...+x^{2^k-1} = (1+x)(1+x^2)...(1+x^{2^{k-1}}).

If {\log_2} and {\exp_2} could be similarly decomposed, our method could be competitive with repeated squaring even for non-wraparound arguments, as well as the Newton-Raphson approach to computing multiplicative inverses. It would be interesting to see a more clever decomposition or a proof that none exists!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s