Remote-Url: https://ciechanow.ski/exposing-floating-point/
Retrieved-at: 2022-01-25 16:53:00.984802+00:00

Despite everyday use, floating point numbers are often understood in a hand-wavy manner and their behaviorraises many eyebrows.
Over the course of this article I’d like to show that things aren’t actually that complicated.This blog post is a companion to my recently launched website –float.exposed. Other than exploiting the absurdity of present daylist of top level domains, it’s intended to be a handy tool for inspecting floating point numbers. While I encourage you to play with it, the purpose of many of its elements may be exotic at first. By the time we’ve finished, however, all of them will hopefully become familiar.On a technical note, by floating point I’m referring to the ubiquitousIEEE 754binary floating point format. Typeshalf,float, anddoubleare understood to be binary16, binary32, and binary64 respectively. There wereother formatsback in the day, but whatever device you’re reading this on ispretty much guaranteedto use IEEE 754.With the formalities out of the way, let’s start at the shallow end of the pool.We’ll begin with the very basics of writing numeric values. The initial steps may seem trivial, but starting from the first principles will help us build a working model of floating point numbers.Decimal NumbersConsider the number 327.849. Digits to the left of the decimal point represent increasing powers of ten, while digits to the right of the decimal point represent decreasing powers of ten:Even though this notation is very natural, it has a few disadvantages:small numbers like 0.000000000653 require skimming over many zeros before they start “showing” actually useful digitsit’s hard to estimate the magnitude of large numbers like 7298345251 at a glanceat some point the distant digits of a number become increasingly less significant and could often be dropped, yet for big numbers we don’t save any space by replacing them with zeros, e.g. 7298000000By “small” and “big” numbers I’m referring to theirmagnitudeso −4205 is understood to be bigger than 0.03 even though it’s to the left of it on the real number line.Scientific notationsolves all these problems. It shifts the decimal point to right after the first non-zero digit and sets the exponent accordingly:Scientific notation has three major components: the sign (+), the significand (3.27849), and the exponent (2). For positive values the “+” sign is often omitted, but we’ll keep it around for the sake of verbosity. Note that the “10” simply shows that we’re dealing with base-10 system. The aforementioned disadvantages disappear:the 0-heavy small number is presented as 6.53×10−10with all the pesky zeros removedjust by looking at the first digit and the exponent of 7.298345251×109we know that number is roughly 7 billionwe can drop the unwanted distant digits from the tail to get 7.298×109Continuing with the protagonist of this section, if we’re only interested in 4 most significant digits we can round the number using one of themany rounding rules:The number of digits shown describes the precision we’re dealing with. A number with 8 digits of precision could be printed as:Binary NumbersWith the familiar base-10 out of the way, let’s look at the binary numbers. The rules of the game are exactly the same, it’s just that the base is 2 and not 10. Digits to the left of the binary point represent increasing powers of two, while digits to the right of the binary point represent decreasing powers of two:When ambiguous I’ll use2to mean the number is in base-2. As such, 10002is not a thousand, but 23i.e. eight. To get the decimal value of the discussed 1001.01012we simply sum up the powers of two that have the bit set: 8 + 1 + 0.25 + 0.0625, ending up with the value of 9.3125.Binary numbers can use scientific notation as well. Since we’re shifting the binary point by three places, the exponent ends up having the value of 3:Similarly to scientific notation in base-10, we also moved the binary point to right after the first non-zero digit of the original representation. However, since the only non-zero digit in base-2 system is 1,everynon-zero binary number in scientific notation starts with a 1.We can round the number to a shorter form:Or show that we’re more accurate by storing 11 binary digits:If you’ve grasped everything that we’ve discussed so far then congratulations – you understand how floating point numbers work.Floating points numbers are just numbers in base-2 scientific notation with the following two restrictions:limited number of digits in the significandlimited range of the exponent – it can’t be greater than some maximum limit and also can’t be less than some minimum limitThat’s (almost) all there is to them.Different floating point types have different number of significand digits and allowed exponent range. For example, afloathas 24 binary digits (i.e. bits) of significand and the exponent range of [−126, +127], where “[” and “]” denote inclusivity of the range (e.g. +127 is valid, but +128 is not). Here’s a number with a decimal value of −616134.5625 that can fit in afloat:−1.00101100110110001101001×219Unfortunately, the number of bits of significand in afloatis limited, so some real values may not be perfectly representable in the floating point form. A decimal number 0.2 has the following base-2 representation:Theoverline(technically known asvinculum) indicates a forever repeating value. The 25thand later significant digits of the perfect base-2 scientific representation of that number won’t fit in afloatand have to be accounted for by rounding the remaining bits. The full significand:1.100110011001100110011001100Will be rounded to:1.10011001100110011001101After multiplication by the exponent the resulting number has adifferentdecimal value than the perfect 0.2:0.20000000298023223876953125If we tried rounding the full significand down:1.10011001100110011001100The resulting number would be equal to:0.199999988079071044921875​No matter what we do, the limited number of bits in the significand prevents us from getting the correct result. This explains why some decimal numbers don’t have their exact floating point representation.Similarly, since the value of the exponent is limited, many huge and many tiny numbers won’t fit in afloat: neither 2200nor 2−300can be represented since they don’t fall into the allowed exponent range of [−126, +127].Knowing the number of bits in the significand and the allowed range of the exponent we can start encoding floating point numbers into their binary representation. We’ll use the number −2343.53125 which has the following representation in base-2 scientific notation:The SignThe sign is easy – we just need 1 bit to express whether the number is positive or negative. IEEE 754 uses the value of0for the former and1for the latter. Since the discussed number is negative we’ll use one:The SignificandFor the significand of afloatwe need 24 bits. However, per whatwe’ve already discussed, the first digit of the significand in base-2 is always 1, so the format cleverly skips it to save a bit. We just have to remember it’s there when doing calculations. We copy the remaining 23 digits verbatim while filling in the missing bits at the end with 0s:The leading “1” we skipped is often referred to as an “implicit bit”.The ExponentSince the exponent range of [−126, +127] allows 254 possible values, we’ll need 8 bits to store it. To avoid special handling of negative exponent values we’ll add a fixedbiasto make sure no encoded exponent is negative.To obtain abiasedexponent we’ll use the bias value of 127. While 126 would work for regular range of exponents, using 127 will let us reserve a biased value of 0 forspecial purposes. Biasing is just a matter of shifting all values to the right:The bias in afloatFor the discussed number we have to shift its exponent of 11 by 127 to get 138, or 100010102and that’s what we will encode as the exponent:Putting it All TogetherTo conform with the standard we’ll put the sign bit first, then the exponent bits, and finally, the significand bits. While seemingly arbitrary, the order is part of the standard’singenuity. By sticking all the pieces together afloatis born:11000101000100100111100010000000The entire encoding occupies 32 bits. To verify we did things correctly we can fire up LLDB and let the hackytype punningdo its work:(lldb) p -2343.53125f
(float) $0 = -2343.53125

(lldb) p/t *(uint32_t *)&$0
(uint32_t) $1 = 0b11000101000100100111100010000000While neither C nor C++ standardstechnicallyrequire afloator adoubleto be represented using IEEE 754 format, the rest of this article will sensibly assume so.The same procedure of encoding a number in base-2 scientific notation can be repeated for almost any number, however, some of them require special handling.Thefloatexponent range allows 254 different values and with a bias of 127 we’re left with two yet unused biased exponent values: 0 and 255. Both are employed for very useful purposes.A Map of FloatsA dry description doesn’t really paint a picture, so let’s present all the special values visually. In the following plot every dot represents a unique positivefloat:All the special valuesIf you have trouble seeing color you canswitch to the alternative version.If you don’t have trouble seeing color you canswitch to the color version.Notice the necessary truncation of a large part of exponents and of a gigantic part of significand values. At your current viewing size you’d have to scroll through roughlywindow widths to see all the values of the significand.We’ve already discussed all the unmarked dots — the normal floats. It’s time to dive into the remaining values.ZeroAfloatnumber with biased exponent value of 0andall zeros in significand is interpreted as positive or negative 0. The arbitrary value of sign (shown as_) decides which 0 we’re dealing with:_0000000000000000000000000000000Yes, the floating point standard specifies both +0.0 and −0.0. This concept is actually useful because it tells us from which “direction” the 0 was approached as a result of storing value too small to be represented in afloat. For instance-10e-30f / 10e30fwon’t fit in afloat, however, it will produce the value of-0.0.When working with zeros note that0.0 == -0.0is true even though the two zeros have different encoding. Additionally,-0.0 + 0.0is equal to0.0, so by default the compiler can’t optimizea + 0.0into justa, however, you canset flagsto relax the strict conformance.InfinityAfloatnumber with maximum biased exponent valueandall zeros in significand is interpreted as positive or negative infinity depending on the value of the sign bit:_1111111100000000000000000000000Infinity arises as a result of rounding a value that’s too large to fit in the type (assuming default rounding mode). In case of afloat, any number in base-2 scientific notation with exponent greater than 127 will become infinity. You can also use macroINFINITYdirectly.The positive and negative zeros become useful again since dividing a positive value by +0.0 will produce a positive infinity, while dividing it by −0.0 will produce a negative infinity.Operations involvingfinitenumbers and infinities are actually well defined and follow common sense property of keeping infinities infinite:any finite value added to or subtracted from ±infinity ends up as ±infinityany finite positive value multiplied by ±infinity ends up as ±infinity, while any finite negative value multiplied by ±infinity flips its sign to ∓infinitydivision by a finite non-zero value works similarly to multiplication (think of division as multiplication by an inverse)square root of a +infinity is +infinityany finite value divided by ±infinity will become ±0.0 depending on the signs of the operandsIn other words, infinities are so big that any shifting or scaling won’t affect their infinite magnitude, only their sign may flip. However, some operations throw a wrench into that simple rule.NaNsAfloatnumber with maximum biased exponent valueandnon-zero significand is interpreted as NaN – Not a Number:The easiest way to obtain NaN directly is by usingNANmacro. In practice though, NaN arises in the following set of operations:±0.0 multiplied by ±infinity−infinity added to +infinity±0.0 divided by ±0.0±infinity divided by ±infinitysquare root of a negative number (−0.0 is fine though!)If the floating point variable is uninitialized, it’s also somewhat likely to contain NaNs. By default the result of any operation involving NaNs will result in a NaN as well. That’soneof the reasons why compiler can’t optimize seemingly simple cases likea + (b - b)into justa. Ifbis NaN the result of the entire operationhas tobe NaN too.NaNs are not equal to anything, even to themselves. If you were to look at your compiler’s implementation ofisnanfunction you’d see something likereturn x != x;.It’s worth pointing out how many different NaN values there are – afloatcan store 223−1 (over 8 million) different NaNs, while adoublefits 252−1 (over 4.5 quadrillion) different NaNs. It may seem wasteful, but the standard specifically made the pool large for, quote, “uninitialized variables and arithmetic-like enhancements”. You can read about one of those uses inAnnie Cherkaev’s very interesting“the secret life of NaN”. Her article also discusses the concepts of quiet and signaling NaNs.Maximum & MinimumThe exponent range limit puts some constraints on the minimum and the maximum value that can be represented with afloat. The maximum value of that type is 2128− 2104(3.40282347×1038). The biased exponent is one short of maximum value and the significand is all lit up:01111111011111111111111111111111The smallestnormalfloatis 2−126(roughly 1.17549435×10−38). Its biased exponent is set to 1 and the significand is cleared out:00000000100000000000000000000000In C the minimum and maximum values can be accessed withFLT_MINandFLT_MAXmacros respectively. WhileFLT_MINis the smallest normal value, it’s notthesmallest value afloatcan store. We can squeeze things down even more.SubnormalsWhen discussing base-2 scientific notation we assumed the numbers were normalized, i.e. the first digit of the significand was 1:+1.00101100110110001101001×219The range of subnormals (also known as denormals) relaxes that requirement. When the biased exponent is set to 0, the exponent is interpreted as −126 (not−127 despite the bias), and the leading digit is assumed to be 0:+0.00000000000110001101001×2−126The encoding doesn’t change, when performing calculations we just have to remember that this time the implicit bit is 0 and not 1:00000000000000000000110001101001While subnormals let us store smaller values than the minimum normal value, it comes at the cost of precision. As the significand decreases we effectively have fewer bits to work with which is more apparent after normalization:The classic example for the need for subnormals is based on simple arithmetic. If two floating point values are equal to each other:Then by simply rearranging the terms it follows that their difference should be equal to 0:Without subnormal values that simple assumption would not be true! Considerxset to a valid normalfloatnumber:+1.01100001111101010000101×2−124Andyas:+1.01100000011001011100001×2−124The numbers are distinct (observe the last few bits of significand). Their difference is:Which is outside of the normal range of afloat(notice the exponent value smaller than −126). If it wasn’t for subnormals the difference after rounding would be equal to 0, thus implying the equality of not equal numbers.On a historical note, subnormals were very controversial part of the IEEE 754 standardization process, you can read about it more in“An Interview with the Old Man of Floating-Point”.Due to the fixed number of bits in the significand floating point numbers can’t store arbitrarily precise values. Moreover, the exponential part causes the distribution of values in afloatto be uneven. In the picture below each tick on the horizontal axis represents a unique float value:ChunkyfloatvaluesNotice how the powers of 2 are special – they define the transition points for the change of “chunkiness”. The distance between representablefloatvalues in between neighboring powers of two (i.e. between 2nand 2n + 1) are constant and we can jump between them by changing the significand by 1 bit.The larger the exponent the “larger” the 1 bit of significand is. For example, the number 0.5 has the exponent value of −1 (since 2−1is 0.5) and 1 bit of its significand jumps by 2−24. For the number 1.0 the step is equal to 2−23. The width of the jump at 1.0 has a dedicated name –machine epsilon. For afloatit can be accessed viaFLT_EPSILONmacro.Starting at 223(decimal value of 8388608) increasing significand by 1 increases the decimal value of float by 1.0. As such, 224(16777216 in base-10) is the limit of the range of integers that can be stored in afloatwithout omittinganyof them. The next float has the value of 16777218, the value of 16777217 can’t be represented in afloat:The end of the gapless regionNote that the type can handlesomelarger integers as well, however, 224defines the end of the gapless region.With a fixed exponent increasing the significand by 1 bit jumps between equidistant float values, however, the format has more tricks up its sleeve. Consider 2097151.875 stored in afloat:01001001111111111111111111111111Ignoring the division into three parts for a second, we can think of the number as a string of 32 bits. Let’s try interpreting them as a 32-bit unsigned integer:01001001111111111111111111111111As a quick experiment, let’s add one to the value…01001010000000000000000000000000…and put the bits verbatim back into thefloatformat:01001010000000000000000000000000We’ve just obtained the value of 2097152.0 which is the next representablefloat– the type can’t storeanyother values between this and the previous one.Notice how adding one overflowed the significand and added one to the exponent value. This is the beauty of putting the exponent partbeforethe significand. It lets us easily obtain the next/previous representable float (away/towards zero) by simply increasing/decreasing its raw integer value.Incrementing the integer representation of the maximumfloatvalue by one? You get infinity. Decrementing the integer form of the minimumfloat? You enter the world of subnormals. Decrease it for the smallest subnormal? You get zero. Things fall into place just perfectly. The two caveats with this trick is that it won’t jump from +0.0 to −0.0 and vice versa, moreover, infinities will “increment” to NaNs, and the last NaN will increment to zero.So far we’ve focused our discussion on afloat, but its popular bigger cousindoubleand the less commonhalfare also worth looking at.DoubleIn base-2 scientific notation adoublehas 53 digits of significand and exponent range of [−1022, +1023] resulting in an encoding with 11 bits dedicated to exponent and 52 bits to significand to form a 64-bit encoding:1011111101001011000101101101100100111101101110100010001101101000HalfHalf-float is used relatively often in computer graphics. In base-2 scientific notation ahalfhas 11 digits of significand and exponent range of [−14, +15] resulting in an encoding with 5 bits dedicated to exponent and 10 bits to significand creating a 16-bit type:halfis really compact, but also has very small range of representable values. Additionally, given only 5 bits of the exponent, almost 1/32 of the possiblehalfvalues are dedicated to NaNs.Larger TypesIEEE 754 specifies128-bit floating point format, however, native hardware support isvery limited. Some compilers willlet you use itwhen__float128type is used, but the operations are usually done in software.The standard also suggests equations for obtaining the number of exponent and significand bits in higher precision formats (e.g.256-bit), but I think it’s fair to say those are rather impractical.Same BehaviorWhile all IEEE 754 types have different lengths, they all behave the same way:±0.0 always has all the bits of the exponent and the significand set to zero±infinity has all ones in the exponent and all zeros in the significandNaNs have all ones in the exponent and a non-zero significandthe encoded exponent of subnormals is 0The only difference between the types is in how many bits they dedicate to the exponent and to the significand.While in practice many floating point calculations are performed using the same type throughout, a type change is often unavoidable. For example, JavaScript’sNumberis just adouble, however, WebGL deals withfloatvalues. Conversions to a larger and a smaller type behave differently.Conversion to a Larger TypeSince adoublehas more bits of the significand and of the exponent than afloatand so does afloatcompared to ahalfwe can be sure that converting a floating-point value to a higher precision type will maintain the exact stored value.Let’s see how this pans out for ahalfvalue of 234.125. Its binary representation is:The same number stored in afloathas the following representation:01000011011010100010000000000000And in adouble:0100000001101101010001000000000000000000000000000000000000000000Note that the new significand bits in a larger format are filled with zeros which simply follows from scientific notation. The new exponent bits are filled with 0s when the highest bit is 1, and with 1s when the highest bit is 0 (you can see it by changing type e.g. for0.11328125) – a result of unbiasing the value with original bias then biasing again with the new bias.Conversion to a Smaller TypeThe following should be fairly unsurprising, but it’s worth going through an example. Consider adoublevalue of−282960.039306640625:1100000100010001010001010100000000101000010000000000000000000000When converting to afloatwe have to account for the significand bits that don’t fit which is by default done usingrounding-to-nearest-evenmethod. As such, the same number stored in afloathas the following representation:11001000100010100010101000000001The decimal value of this float is−282960.03125, i.e. a different number than the one stored in adouble. Converting to ahalfproduces:What happened here? The exponent value of 18 that fits perfectly fine in afloatis too large for the maximum exponent of 15 that ahalfcan handle and the resulting value is −infinity.Converting from a higher to a lower precision floating point type will maintain the exact value if the significand bits that don’t fit in the smaller type are 0sandthe exponent value can be represented in the smaller type. If we were to convert the previously examined234.125from adoubleto afloator to ahalfit would keep its exact value in all three types.A Sidenote on RoundingWhileround-half-up(“If the fraction is .5 – round up”) is the common rounding rule used in everyday life, it’s actually quite flawed. Consider the results of the following made up survey:725 responders said their favorite color isred275 responders said their favorite color isgreenThe distribution of votes is 72.5% and 27.5% respectively. If we wanted to round the percentages to integer values and were to use round-half-up we’d end up with the following outcome: 73% and 28%. To everyone’s dissatisfaction we just made the survey results add up to 101%.Round-to-nearest-even solves this problem by, unsurprisingly, rounding to nearest even value. 72.5% becomes 72%, 27.5% becomes 28%. The expected sum of 100% is restored.Conversion of Special ValuesNeither NaNs nor infinities follow the usual conventions. Their special rule is very straightforward: NaNs remain NaNs and infinities remain infinities in all the type conversions.Working with floating point numbers often requires printing their value so that it can be restored accurately — every bit should maintain its exact value. When it comes toprintf-style formatting characters,%fand%eare commonly used. Sadly, they often fail to maintain enough precision:1234567floatf0=3.0080111026763916015f;floatf1=3.0080118179321289062f;printf("%f\n",f0);printf("%f\n",f1);printf("%e\n",f0);printf("%e\n",f1);Produces:3.008011
3.008011
3.008011e+00
3.008011e+00However, those two floating point numbers arenotthe same and store different values.f0is:01000000010000001000001101000001Andf1differs fromf0by 3:01000000010000001000001101000100The usual solution to this problem is to specify the precision manually to the maximum number of digits. We can useFLT_DECIMAL_DIGmacro (value of 9) for this purpose:12345floatf0=3.0080111026763916015f;floatf1=3.0080118179321289062f;printf("%.*e\n",FLT_DECIMAL_DIG,f0);printf("%.*e\n",FLT_DECIMAL_DIG,f1);Yields:3.008011102e+00
3.008011817e+00Unfortunately, it will print the long form even for simple values, e.g.3.0fwill be printed as3.000000000e+00. It seems thatthere is no wayto configure the printing of floating point values to automatically maintain exact number ofdecimaldigits needed to accurately represent the value.Hexadecimal FormLuckily, hexadecimal form comes to the rescue. It uses%aspecifier and prints the shortest, exact representation of floating point number in a hexadecimal form:12345floatf0=3.0080111026763916015f;floatf1=3.0080118179321289062f;printf("%a\n",f0);printf("%a\n",f1);Produces:0x1.810682p+1
0x1.810688p+1The hexadecimal constant can be used verbatim in code or as an input toscanf\strtofon any reasonable compiler and platform. To verify the results we can fire up LLDB one more time:(lldb) p 0x1.810682p+1f
(float) $0 = 3.0080111

(lldb) p 0x1.810688p+1f
(float) $1 = 3.00801182

(lldb) p/t *(uint32_t *)&$0
(uint32_t) $2 = 0b01000000010000001000001101000001

(lldb) p/t *(uint32_t *)&$1
(uint32_t) $3 = 0b01000000010000001000001101000100The hexadecimal form is exact and concise – each set of four bits of the significand is converted to the corresponding hex digit. Using our example values:1000becomes8,0001becomes1and so on. An unbiased exponent just follows the letterp. You can find more details about the%aspecifier in“Hexadecimal Floating-Point Constants”.Nine digits may be enough tomaintainthe exact value, but it’s nowhere near the number of digits required to show the floating point number in itsfulldecimal glory.While not every decimal number can be represented using floating point numbers (the infamous 0.1),everyfloating point number has its own exact decimal representation. The following example is done on ahalfsince it’s much more compact, but the method is equivalent for afloatand adouble.Let’s consider the value of 3.142578125 stored in ahalf:The equivalent value in scientific base-2 notation is:Firstly, we can convert the significand part to an integer by multiplying it by 1:Which we an cleverly expand:To obtain an integer times a power of two:Then we can combine the fractional part with the exponent part:And in decimal form:We can get rid of the power of two by multiplying it by a cleverly written value of 1 yet another time:We can pair every 2 with every 5 to obtain:Putting back all the pieces together we end up with a product of two integers and a shift of a decimal place encoded in the power of 10:10−9×59×1609 = 3.142578125Coincidentally, the trick of multiplying by 5−n×5nalso explains why negative powers of 2 are just powers of 5 with a shifted decimal place (e.g. 1/4 is 25/100, and 1/16 is 625/10000).Even though the exact decimal representation always exists, it’s often cumbersome to use – some small numbers that can be stored in adoublehaveover 760 significant digitsof decimal representation!My article is just a drop in the sea of resources about floating point numbers. Perhaps the most thorough technical write-ups on floating point numbers is“What Every Computer Scientist Should Know About Floating-Point Arithmetic”. While very comprehensive, I find it difficult to get through. Almost five years have passed since I firstmentioned iton this blog and, frankly, I’ve still limited my engagement to mostly skimming through it.One of the most fascinating resources out there isBruce Dawson’s amazingseries of posts. Bruce dives into a ton of details about the format and its behavior. I consider many of his articles a must-read for any programmer who deals with floating point numbers on a regular basis, but if you only have time for one I’d go with“Comparing Floating Point Numbers, 2012 Edition”.Exploring Binarycontains many detailedarticles on floating point format. As a delightful example, itdemonstratesthat the maximum number of significant digits in the decimal representation of afloatis 112, while adoublerequires up to 767 digits.For a different look on floating point numbers I recommendFabien Sanglard’s“Floating Point Visually Explained”– it shows an interesting concept of the exponent interpreted as a sliding window and the significand as an offset into that window.Even though we’re done, I encourage you to go on. Any of the mentioned resources should let you discover something more in the vast space of floating point numbers.The more I learn about IEEE 754 the more enchanted I feel.William Kahanwith the aid of Jerome Coonen and Harold Stone created something truly beautiful and ever-lasting.I genuinely hope this trip through the details of floating point numbers made them a bit less mysterious and showed you some of that beauty.