Remote-Url: https://cp4space.hatsya.com/2021/09/01/an-efficient-prime-for-number-theoretic-transforms/ Retrieved-at: 2021-12-09 21:36:08.093029+00:00 Complex Projective 4-Space Where exciting things happen [cropped-banner-xmas2] Skip to content * Home * About * Cipher solvers + Season IV o Solved cipher 71 + Season III o Solved cipher 70 o Solved cipher 69 o Solved cipher 68 o Solved cipher 67 o Solved cipher 66 o Solved cipher 65 o Solved cipher 64 o Solved cipher 63 o Solved cipher 62 o Solved cipher 61 o Solved cipher 60 o Solved cipher 59 o Solved cipher 58 o Solved cipher 57 o Solved cipher 56 o Solved cipher 55 o Solved cipher 54 o Solved cipher 53 o Solved cipher 52 o Solved cipher 51 + Season II o Solved cipher 50 o Solved cipher 49 o Solved cipher 48 o Solved cipher 47 o Solved cipher 46 o Solved cipher 45 o Solved cipher 44 o Solved cipher 43 o Solved cipher 42 o Solved cipher 41 o Solved cipher 40 o Solved cipher 39 o Solved cipher 38 o Solved cipher 37 o Solved cipher 36 o Solved cipher 35 o Solved cipher 34 o Solved cipher 33 o Solved cipher 32 o Solved cipher 31 + Season I o Solved cipher 30 o Solved cipher 29 o Solved cipher 28 o Solved cipher 27 o Solved cipher 26 o Solved cipher 25 o Solved cipher 24 o Solved cipher 23 o Solved cipher 22 o Solved cipher 21 o Solved cipher 20 o Solved cipher 19 o Solved cipher 18 o Solved cipher 17 o Solved cipher 16 o Solved cipher 15 o Solved cipher 14 o Solved cipher 13 o Solved cipher 12 o Solved cipher 11 o Solved cipher 10 o Solved cipher 9 o Solved cipher 8 o Solved cipher 7 o Solved cipher 6 o Solved cipher 5 o Solved cipher 4 * Contact * Publications * Revision + IA revision + IB revision * Skydive ? Hamming cube of primes Hamming backups: a 2-of-3 variant of SeedXOR ? An efficient prime for number-theoretic transforms Posted on September 1, 2021 by apgoucher My new favourite prime is 18446744069414584321. It is given by p = \Phi_6(x) = x^2 - x + 1, where x = 2^{32}. This means that, in the finite field \mathbb{F}_p, 2^32 functions as a primitive 6th root of unity, and therefore 2 is a primitive 192nd root of unity. It turns out that this field possesses multiple properties which make it well-suited to performing number-theoretic transforms (the finite field analogue of the Fast Fourier Transform). Arithmetic Firstly, note that arithmetic is especially convenient in this field. An element of the field can be represented as an unsigned 64-bit integer, since p is slightly less than 2^64. We can also efficiently reduce modulo p without involving multiplication or division. In particular, if we have a non-negative integer n less than 2^159, then we can write it in the form: n = A 2^{96} + B 2^{64} + C where A is a 63-bit integer, B is a 32-bit integer, and C is a 64-bit integer. Since 2^96 is congruent to ?1 modulo p, we can then rewrite this as: B 2^{64} + (C - A) If A happened to be larger than C, and therefore the result of the subtraction underflowed, then we can correct for this by adding p to the result. Now we have a 96-bit integer, and wish to reduce it further to a 64-bit integer less than p. To do this, we note that 2^64 is congruent to 2^32 ? 1, so we can multiply B by 2^32 using a binary shift and a subtraction, and then add it to the result. We might encounter an overflow, but we can correct for that by subtracting p. In C/C++, the algorithm is as follows: [reduce159] This involves no multiplication or division instructions. Indeed, we can take a look at the compiled assembly code by using the Godbolt Compiler Explorer: [reduce159asm] Observe that, using LLVM as the compiler, the resulting code is branchless (despite the word 'if' appearing twice in the source code) and all of the instructions are particularly cheap. Now, why would we end up with a 159-bit integer in the first place and want to reduce it? There are two occasions where this subroutine is useful: * To perform a multiplication modulo p, we use machine instructions to multiply the two 64-bit operands to give a 128-bit result. * Left-shifting a 64-bit integer by up to 95 bits gives a 159-bit result. The latter allows us to cheaply multiply an element of our field by any power of 2. In particular, if we want to multiply by 2^(96a + b), then do the following: * if a is odd, then subtract the input from p (in-place), relying on the fact that 2^96 is congruent to ?1; * shift-left by b bits (to give a 159-bit result) and reduce using the subroutine. That is to say, multiplying by any 192nd root of unity is 'fast', bypassing the need to perform a general multiplication. (This is particularly useful for GPUs, where there isn't a 64-bit multiplier in hardware, so multiplication expands to many invocations of the hardware multiplier.) Other roots of unity As well as possessing these 192nd roots of unity, the field also contains roots of unity of any order dividing p ? 1. This factorises as follows: p - 1 = 2^{32} \times 3 \times 5 \times 17 \times 257 \times 65537 where those odd prime factors are the five known Fermat primes. We can perform a Fourier transform of any length that divides p ? 1, since all of the requisite roots of unity exist, but this will be especially efficient if we only involve the smaller prime factors. Most of these other roots of unity are 'slow', meaning that there is no obvious way to multiply by them without performing a general multiplication. There is, however, a nice observation by Sch?nhage: \sqrt{2} = \zeta + \zeta^7, where ? is a primitive 8th root of unity, so we can efficiently multiply by the square-root of 2 (which is, of course, a primitive 384th root of unity). A length-N FFT involves the Nth roots of unity, so we would want to use FFTs of length dividing 192 (or 384 using the Sch?nhage trick) whenever possible. This suggests the following algorithm for convolving two sequences: * choose an FFT length N that divides 3 \times 2^{32} and is large enough to accommodate the result of the convolution; * write N as a product of (at most 5) numbers dividing 384. For example, in the largest case we could write N = 128 ? 128 ? 128 ? 96 ? 64; * use this decomposition as the basis of a mixed-radix Cooley-Tukey FFT. That is to say, we apply: + N/128 FFTs of length 128; + N/128 FFTs of length 128; + N/128 FFTs of length 128; + N/96 FFTs of length 96; + N/64 FFTs of length 64. Since there are at most 5 rounds of FFTs, we only need to apply arbitrary twiddle-factors (which can be precomputed and stored in a table) 4 times, i.e. between successive rounds of FFTs. Compare this to an FFT over the complex numbers, where there are only 4 'fast' roots of unity (?1 and ?i) and therefore irrational twiddle factors must be applied much more often. F?rer's algorithm Martin F?rer's fast integer multiplication algorithm uses this idea recursively, expressing the integer multiplication problem as a convolution over a suitably chosen ring with plenty of fast roots of unity. In particular, he writes: We will achieve the desired speed-up by working over a ring with many "easy" powers of ?. Hence, the new faster integer multiplication algorithm is based on two key ideas: * An FFT version is used with the property that most occurring roots of unity are of low order. * The computation is done over a ring where multiplications with many low order roots of unity are very simple and can be implemented as a kind of cyclic shifts. At the same time, this ring also contains high order roots of unity. The field \mathbb{F}_p for p = 2^{64} - 2^{32} + 1 is a fixed-size ring with these desirable properties. It supports multiplication of integers up to 3 \ times 2^{35} bits (12 GB) by writing them in 'balanced base 2^16' (where each 'digit' can be between ?32768 and 32767). Each of the two integers will occupy at most 3 \times 2^{31} digits, so the result can be computed by a convolution of length 3 \times 2^{32}. The maximum absolute value of a coefficient that could occur as a result of such a convolution is 3 \times 2^{61} < \frac{1}{2} p, so the coefficient can be recovered exactly if the convolution is performed over \mathbb{F}_p by reducing each coefficient into the range [?p/2, +p/2]. Of course, F?rer's paper was interested in the asymptotics of integer multiplication, so it needs to work for arbitrarily large integers (not just integers of size at most 12 GB). The remaining idea was therefore to apply this idea recursively: to split a length-N integer into chunks of size O(log N), and use F?rer's algorithm itself to handle those chunks (by splitting into chunks of size O(log log N), and so forth, until reaching a 'base case' where the integers are sufficiently small that FFT-based methods are unhelpful). The number of layers of recursion is therefore the number of times you need to iteratively take the logarithm of N before log log log ? log log N < K, where K is the base-case cutoff point. This is called the iterated logarithm, and denoted \log^{\star} N. Each layer of recursion in F?rer's algorithm costs a constant factor F more than the previous layer, so the overall complexity is O(N \log N F^{\log^{\ star} N}). It was the asymptotically fastest multiplication algorithm between its invention in 2007 (taking the crown from Sch?nhage-Strassen) and 2016 (when an O(N log N) algorithm was discovered). Practical integer multiplication In practice, large integer multiplication tends to use either Sch?nhage-Strassen (in the case of the GNU Multiprecision Library), or floating-point FFTs, or a bunch of number-theoretic transforms over different primes with a final application of the Chinese Remainder Theorem (in the case of Alexander Yee's y-cruncher). The primes used by these number-theoretic transforms don't support any 'fast' roots of unity, though; all of the twiddle factors are applied using multiplications. This suggests that a number-theoretic transform over the field of 18446744069414584321 elements may indeed be competitive for the sizes of integers it supports (e.g. up to 12 GB). For larger integers, we can use the Sch?nhage-Strassen algorithm with this prime field as a base case (one layer of Sch?nhage-Strassen being more than enough to support multiplication of integers that can be stored in practice). Share this: * Twitter * Facebook * Like this: Like Loading... This entry was posted in Uncategorized. Bookmark the permalink. ? Hamming cube of primes Hamming backups: a 2-of-3 variant of SeedXOR ? Leave a Reply Cancel reply * Search for: [ ] [Search] * Recent Posts + Involutions on a finite set + Hamming backups: a 2-of-3 variant of SeedXOR + An efficient prime for number-theoretic transforms + Hamming cube of primes + Determinacy * Subscribe to Complex Projective 4-Space Enter your email address to subscribe to this blog and receive notifications of new posts by email. Join 2,749 other subscribers Email Address [ ] Subscribe * Archives + December 2021 + September 2021 + July 2021 + June 2021 + May 2021 + February 2021 + January 2021 + December 2020 + November 2020 + October 2020 + September 2020 + August 2020 + July 2020 + June 2020 + May 2020 + June 2019 + May 2019 + March 2019 + November 2018 + September 2018 + July 2018 + June 2018 + May 2018 + April 2018 + March 2018 + February 2018 + November 2017 + October 2017 + November 2016 + May 2016 + March 2016 + February 2016 + December 2015 + September 2015 + March 2015 + February 2015 + January 2015 + December 2014 + November 2014 + October 2014 + September 2014 + August 2014 + July 2014 + June 2014 + May 2014 + April 2014 + March 2014 + February 2014 + January 2014 + December 2013 + November 2013 + October 2013 + September 2013 + August 2013 + July 2013 + June 2013 + May 2013 + April 2013 + March 2013 + February 2013 + January 2013 + December 2012 + November 2012 + October 2012 + September 2012 + August 2012 * Recent Comments + william e emba on Involutions on a finite set + johnicholas on Involutions on a finite set + apgoucher on Involutions on a finite set + Thomas Blok on Involutions on a finite set + apgoucher on Involutions on a finite set Complex Projective 4-Space Proudly powered by WordPress. %d bloggers like this: