Standard precision: 32 bits (4 byte)
Double precision: 64 bits (8 byte)
Real Numbers of Standard Precision
Declaration in programming language C:
float
IEEE (
P | for sign ( P=1 negative, P=0 positive) |
Characteristic | binary exponent + 127 (to avoid display of negative exponent) |
Mantissa | normalized (only one bit in front of a binary spot). |
Example: display of decimal number 5.75
5.7510 = 101.112 * 20 = 1.01112 * 22
Because normalization of every binary number (except zero) displays shape 1.xxxxx, leading 1 is unnecessary. This is why the leading 1 isn’t saved into computer and is referred as hidden bit. This advantage provides us one extra bit of space, giving us higher precision possibility.
P =for sign = 0 (positive number)
Binary exponent = 2 K = 2 + 127 = 129 = (1000 0001)2
Mantissa (whole) 1.0111
Mantissa (without hidden bit) 0111
Resault: 0 10000001 01110000000000000000000
or 0100 0000 1011 1000 0000 0000 0000 0000
4 0 B 8 0 0 0 0 (hexadecimal)
Examples:
2 = 102 * 20 = 12 * 21 = 0100 0000 0000 0000 ... 0000 0000 = 4000 0000 hex
P = 0, K = 1 + 127 = 128 (10000000), M = (1.) 000 0000 ... 0000 0000
-2 = -102 * 20 = -12 * 21 = 1100 0000 0000 0000 ... 0000 0000 = C000 0000 hex
Equal to 2, but P = 1
4 = 1002 * 20 = 12 * 22 = 0100 0000 1000 0000 ... 0000 0000 = 4080 0000 hex
Equal Mantissa, BE = 2, K = 2 + 127 = 129 (10000001)
6 = 1102 * 20 = 1.12 * 22 = 0100 0000 1100 0000 ... 0000 0000 = 40C0 0000 hex
1 = 12 * 20 = 0011 1111 1000 0000 ... 0000 0000 = 3F80 0000 hex
K = 0 + 127 (01111111).
.75 = 0.112 * 20 = 1.12 * 2-1 = 0011 1111 0100 0000 ... 0000 0000 = 3F40 0000 hex
Special Case - 0:
Normalization of number 0 can’t produce shape 1.xxxxx
0 = 0 0000000 0000 ... like 1.02 * 2-127
Range and precision of Real Numbers:
In case of Real number of standard precision, characteristic (8 bits) can be somewhere in interval [0,255].
K = 0 reserved to display zero
K = 255 reserved to display infinity
While BE = K - 127, BE can be created in interval [-126,127].
Smallest positive number different than zero which can be displayed:
1.02 * 2 ‑126 ~ 1.175494350822*10 ‑38
and the biggest is:
1.111111111111111111111112 * 2127 ~2128 = 3.402823669209*1038
Precision: 24 binary digits
224 ~ 10x 24 log 2 ~ x log 10 x ~ 24 log 2 ~ 7.224719895936
about first 7 digits are valid correct.
Display by numerical line:
Numerical mistake:
Not possible to use all bits while calculating:
Example: 0.000110 + 0.990010
0.000110 : (1.)101000110110111000101112 * 2-14
0.990010 : (1.)111110101110000101000112 * 2-1
While adding, binary spots must be one underneath the other:
#.000000000000011010001101 * 20 Only 11 of 24 bits!
+.111111010111000010100011 * 20
=.111111010111011100110000 * 20 = 0,990099906921410
Real numbers in double precise mode
Declaration in program language C:
double
P | forsign ( P=1 negative, P=0 positive) |
Characteristic | binary exponent + 1023 (11 bits) |
Mantissa | normalized (52+1 bit). |
Range:
K [0,2047].
K = 0 reserved for display of zero
K = 2047 reserved for display infinity
BE = K - 1023
BE [-1022,1023]
Smallest positive number different than zero which can be displayed:
1.02 * 2 ‑1022 ~2.225073858507*10 ‑308
and the biggest is:
1.1111.....1111112 * 21023 ~21024 = 1.797693134862316*10308
Correct: 53 binary numbers
253 ~ 10x 53 log 2 ~ x log 10 x ~ 53 log 2 ~ 15.95458977019
near to 16 first numbers are valid.
There is also:
long double 80 bits
Characteristic: 15 bits
Binary exponent: Characteristic - 16383
Real constants
1. 2.34 9e-8 8.345e+25 double
2f 2.34F -1.34e5f float
1.L 2.34L -2.5e-37L long double
sorry for delay...reupped lesson 2 with pictures for better understanding (float and double precision display)
I am searching for a site like this but has materials/tutorials for writing J2SE Java App. I will be grateful if anyone can help me
Nice short overview. Thx for that
the polarizer
Guess I'll have to look elsewhere for a programming tutorial. This one assumes the user already understands many concepts.
Well acctually it doesn't. Try following it from Lesson 1 (found on this website) and take few days for all tutorials. Go slowly through one lesson at the time and when you're stuck, just post the question, and I'll be more than glad to help you! It was written in a Way so you don't need to have any pre programming knowledge.
[quote]
It was written in a Way so you don't need to have any pre programming knowledge.
[/quote]
binary exponent, mantissa, real number and normalised are not terms that you often hear down the pub. I agree with chris, this is more of a reference manual for seasoned programmers than a way to introduce basic concepts to a newbie.
spent a couple hours over lessons 1 + 2 and cross ref Wikipedia for definitiions and further examples- all hangs together; thanks
I kinda have to agree with the others. I have spend hours looking over lessons 1 & 2 and i dot understand it. You need more definitions. Thanks for the other lessons though they are good.
Grath
Hi!
I'm looking at a pretty simple problem that involves reading numbers from a file and storing them in a different file with a small amount of procesing and in somewhat different order, but in IEEE standard integer and floating point formats. I'd like to use a compiler that already stores the numbers in the correct formats, so that in most cases I can just copy the bytes, but in a few cases some bit fiddling will still be needed. So the question is, which Windows XP based C and C++ compilers that suppport the IEEE standard number formats would you recommend?
Most of my programming experience is with DOS based Turbo Pascal, but I did one big application in DOS based C a long time ago, and have intended to upgrade to a windows based C compiler for years, but until now never had any applications (all small and simple) for which C would be better that my very old versions of Tubrbo Pascal. So now is the time to upgrade because the application would be much more convenient to write with a C compiler that already stored the numbers in the proper format. With the Turbo Pascal compiler, I would have to reformat every number, which is very easy for the integers, and not too hard for the floats, but why waste time this way? The switch to C is well overdue.