Handout

CHAPTER 2
What is a Number?
In the study of numerical methods, one should like to clarify what is a number.
First, we remind
the reader of some fundamental mathematics. Next, we consider integers and operations with integers.
We also consider real numbers and some computer approximations thereof. Finally, we examine how
operations are performed on these approximations, and why certain errors arise.
1. Two Views of a Number
1.1. Ring Theory. In mathematics, there are many approaches to numbers. In Algebra, there is
a well developed theory of particular sets, called rings. Elements of a ring allow for the addition and
multiplication, while preserving a number of rules:
•
•
•
•
•
•
•
•
(a + b) + c = a + (b + c) for all a, b, c ∈ R (“associativity of +”)
a + b = b+a for all a, b ∈ R (“commutativity of +”)
There is 0 ∈ R, such that a + 0 = a for all a ∈ R (“additive identity element”)
For each a ∈ R, there is a ∈ R, such that a + (a) = 0 (−a is the “additive inverse” of a)
(a · b) · c = a · (b · c) for all a, b, c ∈ R (“associativity of ·” )
There is 1 ∈ R, such that a · 1 = a and 1 · a = a for all a ∈ R (“multiplicative identity element”)
a · (b + c) = (a · b) + (a · c) for all a, b, c ∈ R (“left distributivity”)
(b + c) · a = (b · a) + (c · a) for all a, b, c ∈ R (“right distributivity”).
Common examples of rings are integers, real numbers, and complex numbers. One could hence see “numbers” as elements of a ring. Unfortunately, the most common representation of numbers on computers
does not follow these rules.
1
x = 1.0+(0.001-1.0)
y = (1.0+0.001)-1.0
print x, y, x == y
print "%.16f %.16f" % (x,y)
Listing 2.1. Floating-point numbers violate associativity of +
1.2. Positional Number Systems. From grade school, we encounter positional number systems.
For example,
422 = 4 × 102 + 2 × 101 + 2 × 100
The number system is called positional because a digit’s meaning depends on its position: e.g., the digit
2 above stands once for for 20 = 2 × 101 and once for 2. In general, we will represent a number by a
sequence of digits (dn dn−1 . . . d1 d0 ), where di ∈ S. S is a finite set of symbols: its size |S| = b is called
the base of the system. The decimal number system uses 10 symbols: S = {0, 1, . . . , 9}. An integer N is
represented as follows:
N = (dn dn−1 . . . d0 )10
=
n
X
di bi = dn × 10n + dn−1 × 10n−1 + · · · + d0 × 100 .
i=0
Computers store information in two-state devices, i.e., devices that are either on or off. Such a device
is said to contain 1 binary digit (bit) of information. Since computers store information in two-state
1
devices, they use the binary number system to represent numbers. The binary system uses the symbols
S = {0, 1} and so has b = 2.
Example 2.1 (Binary Numbers). The binary number N = 10011012 in expanded form is
N = 1 × 26 + 0 × 25 + 0 × 24 + 1 × 23 + 1 × 22 + 0 × 21 + 1 × 20
= 64 + 0 + 0 + 8 + 4 + 0 + 1
= 7710 .
♦
Since each bit can have two states, a string of n bits can have 2 · 2 · · · 2 = 2n distinct bit patterns. For
example,
• a byte of 8 bits can contain one of 28 different bit patterns and so can represent 28 = 256
different things;
• in older computers (IA-32), a word of 4 bytes or 32 bits could represent 232 or 4 billion different
things;
• in most computers since 2003 (x86-64) and many mobile phones since 2013 (ARMv8-A), a word
of 8 bytes or 64 bits could represent 264 or 18, 446, 744, 073 billion different things;
• in some computers, a word of 16 bytes or 128 bits could represent 2128 different things.
There are many ways to interpret a string of bits computers do not use a plain positional system,
especially beyond integers.
2. Integers
We may view computer memory as a set of boxes called words, each of which contains w bits. To store
an integer in a word, we allow the first bit to represent the sign of the number: 0 = + and 1 = −. The
remaining w − 1 bits are used to represent the magnitude of the number. This means we can represent
2w−1 different magnitudes along with their signs in a w-bit word.
Before we can actually store integers (or indeed any information) in a word we have to choose an encoding
or representation that maps the 2w bit patterns to a subset of the integers. We prefer encodings that:
• are one-one correspondences (exactly one encoding for each integer represented);
• make it easy to find the encoding of −x, given the encoding of x.
For example, if w = 3 we could have the following encodings:
Bit pattern:
Unsigned
Signed
Fixed-point
1’s compl.
2’s compl.
000 001 010 011 100 101 110 111
0
1
2
3
4
5
6
7
+0
+1
+2
+3
−0
−1
−2
−3
0 1/8 1/4 3/8 1/2 5/8 3/4 7/8
+0
+1
+2
+3
−3
−2
−1
−0
0
+1
+2
+3
−4
−3
−2
−1
In what follows, let the notation [x] stand for the encoding of x. For example, in E1, we have [3] = 011.
• In unsigned, we represent only non-negative integers x, without the sign.
• In signed, we represent the absolute value of x by the w − 1 bits dw−2 dw−3 . . . d0 . The sign
is represented by dw−1 . If the integer x is non-negative we let dw−1 = 0; if x is non-positive
we let dw−1 = 1. The representation of minus 0 is questionable and might be considered as
inadmissible (this is not a one-one correspondence).
• In fixed point encoding, we again represent only non-negative numbers, but this time fractions
in the range 0 to 7/8 in steps of 1/8 (regard the three bits as being after the “decimal” point,
e.g., 0.1012 = 1 × 2−1 + 0 × 2−2 + 1 × 2−3 = 5/8). Notice that in the fixed point encoding,
unlike in the floating point encoding introduced in the next section, the decimal point is always
in the same place.
2
• In ones complement or representation modulo 2w − 1, given an integer x such that
−2w−1 + 1 ≤ x ≤ 2w−1 − 1
we let


x
[x] = x + 2w − 1


0 or 2w − 1
if x > 0
if x < 0 .
if x = 0
The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.
Given a representation of a non-negative integer x, one obtains a representation of −x as
follows: replace all zeros by ones and all ones by zeros. This is called the ones complement
representation of negative integers.
• In twos complement or representation modulo 2w , given an integer x such that
−2w−1 ≤ x ≤ 2w−1 − 1
we let
(
[x] =
x
if x ≥ 0
x + 2w
if x < 0
.
Given a representation of a non-negative integer x, one obtains a representation of −x as follows:
replace all zeros by ones and all ones by zeros (i.e., apply the logic operation not to each bit),
then add 1. That is, do a ones complement and add 1, carrying to the left if necessary.
The name twos complement for this representation of negative integers is not perfect: only
the rightmost bits up to and including the least significant one are complemented with respect
to 2 (i.e., left unchanged) and all the other bits are complemented with respect to 1.
The twos complement representation is popular as it allows subtraction of one binary
number from another using only bitwise not and the addition operation. This is useful in
microprocessors for which subtraction is not available or would be too complex to perform
efficiently.
Note that there is one extra negative number, −4 in this case (it would be −128 if w = 8,
or −231 if w = 32), for which there is no corresponding positive number.
Here are the numbers −4, −3, . . . , 1, 2, 3 with their representations on a 32-bit PC encoded as (signed)
int). This is a 32-bit twos complement encoding.
-4
-3
-2
-1
0
1
2
3
:
:
:
:
:
:
:
:
11111111111111111111111111111100
11111111111111111111111111111101
11111111111111111111111111111110
11111111111111111111111111111111
00000000000000000000000000000000
00000000000000000000000000000001
00000000000000000000000000000010
00000000000000000000000000000011
For example, to get the representation of −12, start with that of 12:
−→
−→
000 · · · 0001100
111 · · · 1110011
111 · · · 1110100
(Start with 12)
(Bitwise not)
(Add 1, carrying bit to left, giving −12)
It should be clear from this example of 5 encodings that we may store any type of number in a w-bit
word. All we need is the appropriate encoding.
3
2.1. The Range and Distribution of Integers. Most computer use encoding E5 above (twos
complement) with the word-length w = 32, by default.The 232 integers are in the range
−231 ≤ i ≤ +231 − 1
or, approximately,
− 109 ≤ i ≤ +109 .
These integers are distributed uniformly across this range. If the result of an arithmetic operation is
an integer outside this range then an integer overflow occurs. The reaction of the computer to an
integer overflow depends on the machine-compiler system used. For example, if the result of an integer
calculation is i = 231 + 1, then some systems will silently set i = −231 and continue, some will give a
warning, and some will give a fatal error.
3. Real Numbers
Beyond integers, one often needs to
√ approximate numbers of potentially infinite binary and decimal
representation (e.g., 31 = 0.333 · · · , 2) on a machine with finite amount of memory. This often requires
very non-trivial algorithms or some loss of precision.
3.1. Rational Numbers. For rational numbers, which can be expressed as a fraction of two integers, such as 13 , it is easy to store the two integers. Quite surprisingly, few computer programs actually
do this, due to the fact the arithmetic operations over such a representation does not come with hardware
support. Let us illustrate this in Python:
1
from fractions import Fraction
3*Fraction(1, 3)
.
Alternatively, one can represent some rational numbers by extending positional number systems. There,
we represent numbers with fractional parts by allowing the index i of the digits di to be negative. Thus
N with a fractional part is represented as follows:
n
X
N =(dn dn−1 . . . d0 d−1 . . . d−m )10 =
di bi
i=−m
=dn × 10n + · · · + d0 × 100 + d−1 × 10−1 + . . . + d−m × 10−m .
The standard way of writing such numbers is to omit m and n and put a decimal point between the
digits d0 and d−1 . To read such a number we must count the number of digits n before the decimal
point, and the number of digits m after the decimal point. If we need to avoid confusion, we may write
the base as a subscript after the number, e.g., 12310 .
Example 2.2 (Binary Numbers II). The fractional binary number N = 1001.1012 in expanded form is
N = 1 × 23 + 0 × 22 + 0 × 21 + 1 × 20 + 1 × 2−1 + 0 × 2−2 + 1 × 2−3
= 8 + 0 + 0 + 1 + 0.5 + 0 + 0.125
= 9.62510 .
♦
Example 2.3 (Binary Numbers III). The fractional decimal number N = 1.110 in its binary expansion:
N = 1 × 20 + 0 · 2−1 + 0 · 2−2 + 0 · 2−3 + 2−4 + 2−5 + 0 · 2−6 + 2−7
≈ 1.00011001100110011001100110011001100110011001100110011001100110
≈ 1.0999999999999999999132638262011596452794037759304046630859375.
♦
print 1.1+2.2, (1.1+2.2) == 3.3
print "%.16f %.16f %.16f" % (1.1, 2.2, 1.1+2.2)
Listing 2.2. A motivating example from Python Decimal documentation
4
Exercise 2.4. What positional number system would it take to encode
1
3
precisely?
√
3.2. Irrational Numbers. For irrational numbers, such as 2, the situation is very different. No
matter how we pick the positional number system or alphabet, the digits are a finite alphabet. If you
allow for infinitely long strings over a finite alphabet, you still end up with only countably many numbers.
There are, however, uncountably many irrational numbers. Hence, there is no hope of encoding all real
numbers exactly, at the same time, on a digital computer.
One option is to use symbolic representations thereof, but these may slow the computation down considerably:
3
from sympy import *
sympy.sqrt(8)
x = symbols(’x’)
solve(x**2 - 2, x)
Listing 2.3. First example of the use of sympy.
Alternatively, one has to be content with an approximation. In theory, one can use an approximation
with arbitrary precision. This can be implemented in Python using module decimal. In practice, there
is a widespread encoding, the floating-point number system, with well-defined error characteristics. This
encoding has been standardised by the Institute of Electrical and Electronic Engineers (IEEE) and
in-built into most modern computers.
3.3. Floating Point Numbers. A floating point number system tries to represent (part of) the
real numbers R in a way that a computer can handle, and which is useful over as wide a range of numbers
as possible. All floating point number systems represent a real number x as
m
m2
mt 1
x = ±.m × be = ±
+
+
·
·
·
× be = ±.(m1 m2 . . . mt )b × be ,
b1
b2
bt
where m is called the mantissa, e is called the exponent, t is the precision or number of digits in the
mantissa, and b is the base. For example, the decimal number 321.456 has the floating point representation
+.321456 × 10+3 .
Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1. If m1 6= 0 then the number is called
normalised; otherwise, it is called subnormal. The exponent must lie in the range emin ≤ e ≤ emax . Both
the mantissa and the exponent are represented as integers.
The number 0.0 is represented as
0.0 = .(00 . . . 0)b × b0 .
This is a subnormal number because m1 = 0.
3.4. Storage of Floating Point Numbers. A floating point number has a sign, a mantissa, and
an exponent (whose sign is not explicitly written). Thus, the number x = (±, m, e) has three parts and
these are stored in a w-bit word. The sign occupies 1 bit, the mantissa t bits, and the exponent w − t − 1
bits.
So that the exponent field can represent both positive and negative exponents, we add a bias to the actual
exponent, to ensure that the stored exponent is always non-negative. This simplifies the comparison of
exponents.
The actual (unbiased) exponent is needed for input and output only. We allow for half the possible
exponents to be positive, half to be non-positive. Thus for an exponent of w − t − 1 bits, the bias is the
largest integer representable with w − t − 2 bits, that is, 2w−t−2 − 1.
A number that can be represented exactly in a floating point number system is called representable.
Because there is a finite number of these representable numbers, there are infinitely many numbers that
cannot be represented exactly.
For example, the number x = 221.4922 = 0.2214922 × 103 cannot be represented in the base-10 system
with 4-digit mantissa and emin = −8, emax = 8 because x has a 7-digit mantissa. To store this number
5
the mantissa must be reduced to 4 digits. The simplest way to do this is to chop off the last 3 digits
and store the number x̂ = 0.2214 × 103 , which is representable. The (more commonly used) alternative
to chopping is to round the number up or down to the nearest representable number, thus storing x as
x̂ = 0.2215 × 103 .
3.5. The Range of Floating-Point Numbers. The magnitudes of the floating point numbers
have the range
. min(m) × bemin < |x| < . max(m) × bemax .
Now, if x is normalised, then
. min(m) = .(10 . . . 0)b = b−1
and
. max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1 − b−t .
Hence the range of the floating point numbers is
bemin −1 ≤ |x| ≤ (1 − b−t )bemax ,
or approximately (since 1 − b−t ≈ 1)
bemin −1 < |x| < bemax .
Of course, it is possible for the true result of an operation to be so large or so small that it falls outside
the available floating-point range.
• underflow: number too small
• overflow: number too large
Underflow/overflow occurs, roughly speaking, when the result of an arithmetic operation is so small/large
that it cannot be stored in its intended destination format without suffering a rounding error that is
larger than usual.
3.6. The Distribution of Floating-Point Numbers. The distribution of the floating point numbers in this range is not uniform. To see this consider the simple base-10 floating point system with
3-digit significant, emin = 0 and emax = 4. This has a range 10−1 < |x| < 104 , i.e., 0.1 < |x| < 10000.
This also has gaps both large and small: For example, no floating-point numbers exist between the
numbers 2120.0 and 2130.0.
For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalised numbers uniformly distributed between .(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1 − b−t = .999.
Hence, for e = 0 we have the 900 numbers uniformly distributed from .100 × 100 to .999 × 100 . The
spacing between these numbers is 1/1000.
For e = 1 we have the 900 numbers uniformly distributed from .100 × 101 to .999 × 101 , i.e., 1 to 10.
The spacing between these numbers is 1/100.
For e = 2 we have the 900 numbers uniformly distributed from .100 × 102 to .999 × 102 , i.e., 10 to 100.
The spacing between these numbers is 1/10.
We can see that, in general, when the exponent is small the spacing between floating point numbers is
small; and when it is large the spacing is large.
0
.1
-R
1
10
100
Figure 2.1. The distribution of the floating point numbers.
6
3.7. Machine Epsilon. The spacing between any pair of adjacent normalised numbers can be
defined in terms of the spacing about 1.0. This spacing is called machine epsilon, εm . Machine epsilon
can be defined as
(1) εm = distance between 1.0 and the next higher floating point number, or
(2) εm = the smallest number such that 1.0 + εm when chopped is distinct from 1.0.
Using the first definition, we have
εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .
and this can be used in the following lemma of [Higham, 2002, p. 41]:
Lemma 2.5 (Floating Point Spacing). The spacing between a normalised floating point number x and an
adjacent normalised number is at least b−1 εm |x| and at most εm |x|, unless x or its neighbour is 0.
This lemma clearly shows that the spacing between adjacent floating point numbers varies with the
magnitude of x. Thus the minimum spacing is bemin −t and the maximum spacing is bemax −t .
3.8. IEEE Arithmetics. The Institute of Electrical and Electronic Engineers (IEEE) standard
745 was published in 1985. It specifies how hardware and software combine to perform well-defined
operations on standard floating
point numbers. Many processors have built-in floating point operations
√
that include +, −, ×, /, , along with comparison and conversion operations, on one, two, or three
“formats” of numbers. There are:
• Double 8-byte
– 53 = 52 + 1 significant bits mantissa, 11-bit exponent
– Default in Python, C/C++/Java double, Fortran REAL*8
• Single 4-byte
– 24 = 23 + 1 significant bits mantissa, 8-bit exponent
– e.g., Python numpy.float32, C/C++/Java float, Fortran REAL*4,
• Quad 16-byte
– 113 significant bits mantissa, 15-bit exponent
– The implementation varies. Python may have numpy.float128, C/C++ may have
long double or not, depending on the platform.
Let us illustrate this in Python:
1
x = 1.0
y = numpy.float32(1.0)
z = numpy.float128(1.0)
print x, y, z
. i
The focus will be on 8-byte floating-point numbers.
Type
Single
Double
Quad
Unit roundoff
ur = 2−24 ≈ 5.96 × 10−8
ur = 2−53 ≈ 1.11 × 10−16
ur = 2−113 ≈ 9.63 × 10−35
Smallest
2−126 ≈ 1.18 × 10−38
2−1022 ≈ 2.82 × 10−308
2−16382 ≈ 3.36 × 10−4932
Largest
(2 − 2−23 ) × 2127 ≈ 3.40 × 1038
(2 − 2−52 ) × 21023 ≈ 1.41 × 10308
216384 − 216272 ≈ 1.19 × 104932
Since the sign of floating point numbers is given by the leading bit, the range for negative numbers is
just the negation of the above values.
Although IEEE single precision floating point numbers are becoming less common (outside GPGPUs),
let’s use them to demonstrate the limits of floating-point numbers. Single precision numbers cannot
represent:
(1) Negative numbers less than −(2 − 2−23 ) × 2127 (negative overflow)
(2) Negative numbers greater than −2−126 (negative underflow)
7
(3) Zero (treated as a special case, all bits of both exponent and mantissa set to 0: so you can have
both +0 and −0)
(4) Positive numbers less than 2−126 (positive underflow)
(5) Positive numbers greater than (2 − 2−23 ) × 2127 (positive overflow)
Let us explain the reasoning. For IEEE single precision with 8-bit exponent, the bias is 127 = 28−1 − 1.
For IEEE double precision with 11-bit exponent the bias is 1023 = 211−1 − 1 (see §3.4). For example, in
single precision, a stored value of 100 would mean a true exponent of 100 − 127 = −27; a stored value
of 150 means an exponent of 150 − 127 = 23. In fact, exponents of −127 (stored as all 0s) and +128
(stored as all 1s) are reserved for special numbers.
The mantissa is composed of an implicit leading bit and the fraction bits. As the only non-zero number
in binary is 1, it follows that a normalised binary number must look like 1.d1 d2 . . . dn . Thus we do not
need to store the leading 1; the 23 bits (or 52 in double precision) of the mantissa can all be used to store
the part of the number after the “decimal” point. We call this hidden bit normalisation: thus single and
double precision actually have 24 and 53 bits for the mantissa, respectively.
√
Exercise 2.6. Try
to prove that 2 is an irrational number, i.e., there exists no integers p, q such that
√
the fraction pq = 2. How many different proofs can you find?
Exercise 2.7. Implement the sequence x0 = 4, x1 = 4.25, xn = 108 − (815 − 1500/xn−2 )/xn−1 , which
has been suggested by Muller et al. [2010], first using the default floating-point numbers, and second
using Decimal with increasing precision. What is the correct value?
8
Bibliography
Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied
Mathematics, second edition, 2002. doi: 10.1137/1.9780898718027. URL http://epubs.siam.
org/doi/abs/10.1137/1.9780898718027.
Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vincent Lefèvre,
Guillaume Melquiond, Nathalie Revol, Damien Stehlé, and Serge Torres. Handbook of Floating-Point
Arithmetic. Birkhäuser Boston, 2010. ACM G.1.0; G.1.2; G.4; B.2.0; B.2.4; F.2.1., ISBN 978-0-81764704-9.
9