In this post, we’re going to investigate an underexplored bridge between computer science and algebraic number theory. To motivate it, consider the analogy between floating point arithmetic and the theoretical real numbers. While floating points can only approximate the precision of a real number, much of what is rigorously established in the theoretical world carries over to the computational world. For example, the chain rule from calculus powers backpropagation of neural nets, and the spectral theorem from linear algebra plays a key role in principal component analysis. Here we are concerned with a similar analogy which applies to *integer* types rather than floating point types.

`double`

: :: `int`

: (?)

One obvious theoretical candidate is the ring of integers . For our purposes, however, a far more useful candidate is the ring of *2-adic integers* . For a primer on this connection and how it is used today to optimize pointer arithmetic in compilers, see this excellent blog post from Dan Piponi. Intuitively, the 2-adic integers are what happens when one performs unsigned integer arithmetic with an infinite number of leading bits. Although this space only exists theoretically, it can serve as a useful mental model in practice: for any finite *n*, one can “reduce” a 2-adic integer mod 2^{n} to get a standard *n*-bit number, and this reduction is compatible with addition and multiplication.

Our motivation for using this space is that has many nice properties which lacks, but which are nevertheless shared with the finite reductions . For example, in , *every* odd number has a *multiplicative inverse*, and *exact division* by that odd number is equivalent to multiplication by its inverse. This is what enables compilers to turn some divisions into multiplications at compile time. In contrast, only have multiplicative inverses in the usual ring of integers .

Another nice property of is that it is a *complete metric space *under the *2-adic metric*, which enables one to use analytic methods such as Taylor series and a 2-adic version of Newton’s method. There are also 2-adic analogues of the usual exponential and logarithm functions, which we’ll denote by and . These methods and functions are compatible with passing to the reductions .

Now fix a choice of . We’ll dive more into the details in a bit, but our final goal will be to leverage these methods to derive ** an algorithm for computing the power of one n-bit integer by another^{1}**. This algorithm will only require multiplications. In comparison, the method of repeated squaring requires multiplications on average. In practice, our C++ implementation for shows a ~

**4x**speedup over repeated squaring. In a later post, we will derive

**which is again demonstrably faster than existing approaches using the traditional Newton’s method.**

*an algorithm for computing the k-th root of a perfect k-th power integer*### Theoretical Background

Before giving a precise definition, we’ll state some properties of and and see how they plug into our final algorithm. One thing to keep in mind is that these are ** not** in any sense approximations to the natural logarithm and exponential, but rather analogues which operate in a fundamentally different realm.

The first thing to understand about a function is its domain and range. In most textbooks on p-adic analysis, the 2-adic logarithm would be defined on the entire space of 2-adic numbers. However, since we are trying to do actual computations, we will instead fix an integer (take say), and define it for the space of unsigned *n*-bit integers . With this convention, we can specify the domain and range of with the following function signature:

That is, the domain of is the set of *n*-bit integers congruent to , i.e. whose trailing two bits are `01`

, and its range is the set of integers congruent to , i.e. whose trailing two bits are `00`

.

The 2-adic exponential has the opposite function signature:

Together, these two functions satisfy the inversion formulas

on their respective domains. They also obey the same sum-product identities as the usual exponential and logarithm:

**N.b.** We are using modulo addition and multiplication for *n*-bit integers.

Putting this in more theoretical language, exhibits an isomorphism between the group under multiplication and the group under addition, and is its inverse.

*Aside:* So far this all sounds vaguely similar to the usual logarithm and exponential, and our choice of *n* is analogous to choosing the number of bits of precision in the floating point world. A crucial difference is that both the inversion and sum-product formulas are exact ** even after we have fixed n**. Contrast this with the world of floats where even simple operations like addition incur rounding errors.

So how does this help with our exponentiation and root algorithms? First note that the sum identity above immediately implies

Thus, when , we can compute as

One can extend this to all by observing that integer powers of both and are cheap to compute^{2}.

Summarizing, we have “reduced” the integer exponential calculation to an a and a single multiplication. We will see in a future post that this works just as well if x is a fraction, which enables us to compute roots. Thus, the performance of our exponentiation and root algorithms come down to how efficiently we can compute and .

### Precise Definitions

Now that we’ve seen how these functions plug into our algorithm, let’s actually define them. In short, they are defined by the usual convergent Taylor series:

where .

*Those fractions don’t look like integers.* For odd integers in the denominator, the division is really a shorthand for multiplication by the 2-adic multiplicative inverse. For powers of two in the denominator, we do mean the usual notion of division, but our assumption that is a multiple of guarantees that the numerator will always be more “2-divisible” than the denominator, so the result is always still a (2-adic) integer.

*What does “convergent” mean in this context?* The convergence here is with respect to the -adic topology. Concretely, it means that for any finite , eventually the terms will all be divisible by , and so the -bit representations of those terms is 0. Thus, for any fixed , the series is actually a finite sum.

To give a few concrete examples, for we have

as numbers modulo 2^{64}.

*How many terms do we actually need to compute?* That depends on . For a given unsigned integer , define , the “2-divisibility” of as follows:

- The exponent of the largest power of 2 which divides .
- Equivalently, the number of trailing zeros in the base-2 representation of .

In typical mathematical conventions, is set to be . In practice, the function `ctz`

and its variants offered by many modern CPU architectures computes , although the result is undefined when 0 is passed as an argument.

This function has the following properties:

In this notation, we only need to include the terms of the Taylor series for which . Getting back to our functions and , we are assuming in both cases that 4 divides , and so therefore . Using this, one can derive lower bounds for the terms of the series for and :

,

.

*Exercise*: Prove these estimates using the above properties.

From here, one can estimate that will require no more than terms, while will require no more than terms.

More generally, if , we have

,

,

in which case requires no more than terms, and requires no more than terms. More coarsely, each requires about terms.

**The Code **

### First Pass

Now that we have laid the theoretical foundation, we can start working on an implementation. For simplicity, we will now assume . The above estimates suggest we’ll need up to about the 35th term of the Taylor series for , and up to the 64th term for . A more precise calculation yields that for , we only need up to the 32nd term, while for , we only need up to the 58th term. Without further ado, here is the code.

Great! So we’ve effectively turned exponentiation into multiplication via a pair of functions which… each require over 50 multiplications. Hm. Not great. The problem is that the number of terms of the Taylor series is proportional to the number of bits, and each additional term typically requires one or two additional multiplies^{3}. In the next section, we’ll see how to cut this number down.

### Optimizing the number of terms

The key observation is that the number of terms we need to compute depends greatly on our assumption that . We saw earlier that when has higher 2-adic order, say , the series converges much faster. How can we leverage this faster convergence while still handling all possible input values?

One way to achieve this it to use a *lookup table* to adjust the lower order bits of , while modifying the final answer with pre-computed values. Let’s make this explicit.

First, pre-compute

Then, at runtime, if , we can reduce to the case as follows:

*Exercise*: Show that if , then

- .

In both cases, we are able to bump up the valuation of at the cost of a lookup, a subtraction, and a multiplication. Moreover, since the multiplication is always by , it can be further optimized to a shift-add.

Now how do we choose the number of lookups ? From our earlier estimates, the number of Taylor terms we need to compute is about . Taylor terms and lookups each require about one multiplication, suggesting the optimal is that which minimizes . This is minimized by . As it turns out, is actually slightly better and only requires 6 terms of the Taylor series for and , which only require 4 and 5 multiplications respectively.

The code for the optimized power function can be found here, alongside a basic implementation of the power function using repeated squaring.

### One last optimization

Before we get to the benchmarks, there is one more optimization to make that is specific to the exponentiation problem, which is to combine ingredients from the repeated squaring and 2-adic approaches. Namely, instead of using a lookup table to compute and for the lower-order bits, we directly compute the final answer for those bits without passing into “log space” at all. More explicitly, we can decompose the problem as

where . Computing the first term directly with repeated squaring requires fewer operations than using the lookup table approach. For the second term, we still compute it as

However, in general when is odd, we have

.

Thus, both and can be computed directly using only about terms of the Taylor series.

The code for this combined approach is here.

### Benchmarks

Now that we have something a bit more reasonable, let’s put it to the test on the exponentiation problem mentioned at the beginning, comparing to the usual approach of repeated squaring. Our benchmark takes a random 64-bit integer as the exponent and a random *odd* 64-bit integer as the base. An even base will almost always exponentiate to zero, which unfairly benefits the 2-adic power function as it will return early.

The benchmark code can be found here. For the task of computing randomly chosen powers on an Intel i7, we see the following results for the standard repeated squaring approach, the “adic” approach with lookup tables, and the combined approach, respectively:

```
Standard: 591ms
Adic: 266ms
Combined: 151ms
```

### Closing Thoughts

We have seen that leveraging 2-adic analytic techniques can pay dividends in the computational world. Although we initially set our sights on optimizing a specific computational task, we hope that this exposition makes a compelling case that there is more to be explored in this area.

For example, nearly everything we’ve talked about can be generalized to arithmetic modulo for any prime . A natural question is whether this has applications to cryptography, where one is often performing modular exponentiation. One caveat to the methods here is that they will not help with computations mod , but rather with the computations mod outside the “purely mod ” aspects. This is similar to how we used powers of and to reduce to the case . That said, higher prime powers do appear in some cryptographic protocols, e.g. prime power RSA.

We conclude with a few open questions:

- and can each be computed in multiplications. Using Harvey-Hoeven multiplication, this yields a complexity upper bound of . What is the true minimum complexity of and ?
- See also footnote 3 below.

- Can the cases be handled more elegantly? Is there an efficient branchless implementation of and ?
- In the extreme case where , we only need up to the linear term of the Taylor series for and . Thus, multiplication reduces to addition via the identity

.

- Is there a more general “2-adic” approach to multiplication that is competitive with FFT-based methods?

*Thanks to Nathan Bronson for feedback on an earlier version of this post.*

### Footnotes

^{1}The benchmarks and complexity analysis are done for *randomly chosen* bit integers and . In particular, with probability , the exponentiation will wrap around in bit space; that is, as natural numbers, and we are only computing its value mod . For choices of and which do not wrap around, repeated squaring will be much faster, since then and so at most multiplications are needed.

^{2}Any nonzero integer can be written uniquely as where . Powers of -1 depend only on the parity of the exponent, and powers of 2 can be computed as a bit-shift.

^{3}It is not obvious or even true in general that the number of multiplications required to compute a partial Taylor series is linear in the number of terms. For example, one can compute the partial sum of the first terms of

in only multiplications via

.

If and could be similarly decomposed, our method could be competitive with repeated squaring even for non-wraparound arguments, as well as the Newton-Raphson approach to computing multiplicative inverses. It would be interesting to see a more clever decomposition or a proof that none exists!