Remote-Url: https://cp4space.hatsya.com/2021/09/01/an-efficient-prime-for-number-theoretic-transforms/
Retrieved-at: 2021-12-09 21:36:08.093029+00:00
Complex Projective 4-Space
Where exciting things happen
[cropped-banner-xmas2]
Skip to content
* Home
* About
* Cipher solvers
+ Season IV
o Solved cipher 71
+ Season III
o Solved cipher 70
o Solved cipher 69
o Solved cipher 68
o Solved cipher 67
o Solved cipher 66
o Solved cipher 65
o Solved cipher 64
o Solved cipher 63
o Solved cipher 62
o Solved cipher 61
o Solved cipher 60
o Solved cipher 59
o Solved cipher 58
o Solved cipher 57
o Solved cipher 56
o Solved cipher 55
o Solved cipher 54
o Solved cipher 53
o Solved cipher 52
o Solved cipher 51
+ Season II
o Solved cipher 50
o Solved cipher 49
o Solved cipher 48
o Solved cipher 47
o Solved cipher 46
o Solved cipher 45
o Solved cipher 44
o Solved cipher 43
o Solved cipher 42
o Solved cipher 41
o Solved cipher 40
o Solved cipher 39
o Solved cipher 38
o Solved cipher 37
o Solved cipher 36
o Solved cipher 35
o Solved cipher 34
o Solved cipher 33
o Solved cipher 32
o Solved cipher 31
+ Season I
o Solved cipher 30
o Solved cipher 29
o Solved cipher 28
o Solved cipher 27
o Solved cipher 26
o Solved cipher 25
o Solved cipher 24
o Solved cipher 23
o Solved cipher 22
o Solved cipher 21
o Solved cipher 20
o Solved cipher 19
o Solved cipher 18
o Solved cipher 17
o Solved cipher 16
o Solved cipher 15
o Solved cipher 14
o Solved cipher 13
o Solved cipher 12
o Solved cipher 11
o Solved cipher 10
o Solved cipher 9
o Solved cipher 8
o Solved cipher 7
o Solved cipher 6
o Solved cipher 5
o Solved cipher 4
* Contact
* Publications
* Revision
+ IA revision
+ IB revision
* Skydive
? Hamming cube of primes
Hamming backups: a 2-of-3 variant of SeedXOR ?
An efficient prime for number-theoretic transforms
Posted on September 1, 2021 by apgoucher
My new favourite prime is 18446744069414584321.
It is given by p = \Phi_6(x) = x^2 - x + 1, where x = 2^{32}. This means that,
in the finite field \mathbb{F}_p, 2^32 functions as a primitive 6th root of
unity, and therefore 2 is a primitive 192nd root of unity. It turns out that
this field possesses multiple properties which make it well-suited to
performing number-theoretic transforms (the finite field analogue of the Fast
Fourier Transform).
Arithmetic
Firstly, note that arithmetic is especially convenient in this field. An
element of the field can be represented as an unsigned 64-bit integer, since p
is slightly less than 2^64. We can also efficiently reduce modulo p without
involving multiplication or division. In particular, if we have a non-negative
integer n less than 2^159, then we can write it in the form:
n = A 2^{96} + B 2^{64} + C
where A is a 63-bit integer, B is a 32-bit integer, and C is a 64-bit integer.
Since 2^96 is congruent to ?1 modulo p, we can then rewrite this as:
B 2^{64} + (C - A)
If A happened to be larger than C, and therefore the result of the subtraction
underflowed, then we can correct for this by adding p to the result. Now we
have a 96-bit integer, and wish to reduce it further to a 64-bit integer less
than p. To do this, we note that 2^64 is congruent to 2^32 ? 1, so we can
multiply B by 2^32 using a binary shift and a subtraction, and then add it to
the result. We might encounter an overflow, but we can correct for that by
subtracting p.
In C/C++, the algorithm is as follows:
[reduce159]
This involves no multiplication or division instructions. Indeed, we can take a
look at the compiled assembly code by using the Godbolt Compiler Explorer:
[reduce159asm]
Observe that, using LLVM as the compiler, the resulting code is branchless
(despite the word 'if' appearing twice in the source code) and all of the
instructions are particularly cheap.
Now, why would we end up with a 159-bit integer in the first place and want to
reduce it? There are two occasions where this subroutine is useful:
* To perform a multiplication modulo p, we use machine instructions to
multiply the two 64-bit operands to give a 128-bit result.
* Left-shifting a 64-bit integer by up to 95 bits gives a 159-bit result.
The latter allows us to cheaply multiply an element of our field by any power
of 2. In particular, if we want to multiply by 2^(96a + b), then do the
following:
* if a is odd, then subtract the input from p (in-place), relying on the fact
that 2^96 is congruent to ?1;
* shift-left by b bits (to give a 159-bit result) and reduce using the
subroutine.
That is to say, multiplying by any 192nd root of unity is 'fast', bypassing the
need to perform a general multiplication. (This is particularly useful for
GPUs, where there isn't a 64-bit multiplier in hardware, so multiplication
expands to many invocations of the hardware multiplier.)
Other roots of unity
As well as possessing these 192nd roots of unity, the field also contains roots
of unity of any order dividing p ? 1. This factorises as follows:
p - 1 = 2^{32} \times 3 \times 5 \times 17 \times 257 \times 65537
where those odd prime factors are the five known Fermat primes. We can perform
a Fourier transform of any length that divides p ? 1, since all of the
requisite roots of unity exist, but this will be especially efficient if we
only involve the smaller prime factors.
Most of these other roots of unity are 'slow', meaning that there is no obvious
way to multiply by them without performing a general multiplication. There is,
however, a nice observation by Sch?nhage: \sqrt{2} = \zeta + \zeta^7, where ?
is a primitive 8th root of unity, so we can efficiently multiply by the
square-root of 2 (which is, of course, a primitive 384th root of unity).
A length-N FFT involves the Nth roots of unity, so we would want to use FFTs of
length dividing 192 (or 384 using the Sch?nhage trick) whenever possible. This
suggests the following algorithm for convolving two sequences:
* choose an FFT length N that divides 3 \times 2^{32} and is large enough to
accommodate the result of the convolution;
* write N as a product of (at most 5) numbers dividing 384. For example, in
the largest case we could write N = 128 ? 128 ? 128 ? 96 ? 64;
* use this decomposition as the basis of a mixed-radix Cooley-Tukey FFT. That
is to say, we apply:
+ N/128 FFTs of length 128;
+ N/128 FFTs of length 128;
+ N/128 FFTs of length 128;
+ N/96 FFTs of length 96;
+ N/64 FFTs of length 64.
Since there are at most 5 rounds of FFTs, we only need to apply arbitrary
twiddle-factors (which can be precomputed and stored in a table) 4 times, i.e.
between successive rounds of FFTs. Compare this to an FFT over the complex
numbers, where there are only 4 'fast' roots of unity (?1 and ?i) and therefore
irrational twiddle factors must be applied much more often.
F?rer's algorithm
Martin F?rer's fast integer multiplication algorithm uses this idea
recursively, expressing the integer multiplication problem as a convolution
over a suitably chosen ring with plenty of fast roots of unity. In particular,
he writes:
We will achieve the desired speed-up by working over a ring with many
"easy" powers of ?. Hence, the new faster integer multiplication algorithm
is based on two key ideas:
*
An FFT version is used with the property that most occurring roots of
unity are of low order.
*
The computation is done over a ring where multiplications with many low
order roots of unity are very simple and can be implemented as a kind
of cyclic shifts. At the same time, this ring also contains high order
roots of unity.
The field \mathbb{F}_p for p = 2^{64} - 2^{32} + 1 is a fixed-size ring with
these desirable properties. It supports multiplication of integers up to 3 \
times 2^{35} bits (12 GB) by writing them in 'balanced base 2^16' (where each
'digit' can be between ?32768 and 32767). Each of the two integers will occupy
at most 3 \times 2^{31} digits, so the result can be computed by a convolution
of length 3 \times 2^{32}. The maximum absolute value of a coefficient that
could occur as a result of such a convolution is 3 \times 2^{61} < \frac{1}{2}
p, so the coefficient can be recovered exactly if the convolution is performed
over \mathbb{F}_p by reducing each coefficient into the range [?p/2, +p/2].
Of course, F?rer's paper was interested in the asymptotics of integer
multiplication, so it needs to work for arbitrarily large integers (not just
integers of size at most 12 GB). The remaining idea was therefore to apply this
idea recursively: to split a length-N integer into chunks of size O(log N), and
use F?rer's algorithm itself to handle those chunks (by splitting into chunks
of size O(log log N), and so forth, until reaching a 'base case' where the
integers are sufficiently small that FFT-based methods are unhelpful).
The number of layers of recursion is therefore the number of times you need to
iteratively take the logarithm of N before log log log ? log log N < K, where K
is the base-case cutoff point. This is called the iterated logarithm, and
denoted \log^{\star} N.
Each layer of recursion in F?rer's algorithm costs a constant factor F more
than the previous layer, so the overall complexity is O(N \log N F^{\log^{\
star} N}). It was the asymptotically fastest multiplication algorithm between
its invention in 2007 (taking the crown from Sch?nhage-Strassen) and 2016 (when
an O(N log N) algorithm was discovered).
Practical integer multiplication
In practice, large integer multiplication tends to use either
Sch?nhage-Strassen (in the case of the GNU Multiprecision Library), or
floating-point FFTs, or a bunch of number-theoretic transforms over different
primes with a final application of the Chinese Remainder Theorem (in the case
of Alexander Yee's y-cruncher). The primes used by these number-theoretic
transforms don't support any 'fast' roots of unity, though; all of the twiddle
factors are applied using multiplications.
This suggests that a number-theoretic transform over the field of
18446744069414584321 elements may indeed be competitive for the sizes of
integers it supports (e.g. up to 12 GB). For larger integers, we can use the
Sch?nhage-Strassen algorithm with this prime field as a base case (one layer of
Sch?nhage-Strassen being more than enough to support multiplication of integers
that can be stored in practice).
Share this:
* Twitter
* Facebook
*
Like this:
Like Loading...
This entry was posted in Uncategorized. Bookmark the permalink.
? Hamming cube of primes
Hamming backups: a 2-of-3 variant of SeedXOR ?
Leave a Reply Cancel reply
* Search for: [ ] [Search]
* Recent Posts
+ Involutions on a finite set
+ Hamming backups: a 2-of-3 variant of SeedXOR
+ An efficient prime for number-theoretic transforms
+ Hamming cube of primes
+ Determinacy
* Subscribe to Complex Projective 4-Space
Enter your email address to subscribe to this blog and receive
notifications of new posts by email.
Join 2,749 other subscribers
Email Address [ ]
Subscribe
* Archives
+ December 2021
+ September 2021
+ July 2021
+ June 2021
+ May 2021
+ February 2021
+ January 2021
+ December 2020
+ November 2020
+ October 2020
+ September 2020
+ August 2020
+ July 2020
+ June 2020
+ May 2020
+ June 2019
+ May 2019
+ March 2019
+ November 2018
+ September 2018
+ July 2018
+ June 2018
+ May 2018
+ April 2018
+ March 2018
+ February 2018
+ November 2017
+ October 2017
+ November 2016
+ May 2016
+ March 2016
+ February 2016
+ December 2015
+ September 2015
+ March 2015
+ February 2015
+ January 2015
+ December 2014
+ November 2014
+ October 2014
+ September 2014
+ August 2014
+ July 2014
+ June 2014
+ May 2014
+ April 2014
+ March 2014
+ February 2014
+ January 2014
+ December 2013
+ November 2013
+ October 2013
+ September 2013
+ August 2013
+ July 2013
+ June 2013
+ May 2013
+ April 2013
+ March 2013
+ February 2013
+ January 2013
+ December 2012
+ November 2012
+ October 2012
+ September 2012
+ August 2012
* Recent Comments
+ william e emba on Involutions on a finite set
+ johnicholas on Involutions on a finite set
+ apgoucher on Involutions on a finite set
+ Thomas Blok on Involutions on a finite set
+ apgoucher on Involutions on a finite set
Complex Projective 4-Space
Proudly powered by WordPress.
%d bloggers like this: