Lesson 2: IEEE Standards

Published Saturday, February 11, 2006 by Vurdlak | E-mail this post

This lesson is next and final step before we start to code. It is about decoding numbers and saving them into computer using IEEE protocols for standard and double precision. Normalization procedures are shown precisely and are really easy to understand.

Display of Real Numbers by a computer

Standard precision: 32 bits (4 byte)

Double precision: 64 bits (8 byte)

Real Numbers of Standard Precision

Declaration in programming language C:

float

IEEE (Institute of Electrical and Electronics Engineers) standard 754 for display of real numbers in standard precision:

P	for sign ( P=1 negative, P=0 positive)
Characteristic	binary exponent + 127 (to avoid display of negative exponent)
Mantissa	normalized (only one bit in front of a binary spot).

Example: display of decimal number 5.75

5.75₁₀ = 101.11₂ * 2⁰ = 1.0111₂ * 2²

Because normalization of every binary number (except zero) displays shape 1.xxxxx, leading 1 is unnecessary. This is why the leading 1 isn’t saved into computer and is referred as hidden bit. This advantage provides us one extra bit of space, giving us higher precision possibility.

P =for sign = 0 (positive number)

Binary exponent = 2 K = 2 + 127 = 129 = (1000 0001)₂

Mantissa (whole) 1.0111

Mantissa (without hidden bit) 0111

Resault: 0 10000001 01110000000000000000000

or 0100 0000 1011 1000 0000 0000 0000 0000

4 0 B 8 0 0 0 0 (hexadecimal)

Examples:

2 = 10₂ * 2⁰ = 1₂ * 2¹ = 0100 0000 0000 0000 ... 0000 0000 = 4000 0000 hex

P = 0, K = 1 + 127 = 128 (10000000), M = (1.) 000 0000 ... 0000 0000

-2 = -10₂ * 2⁰ = -1₂ * 2¹ = 1100 0000 0000 0000 ... 0000 0000 = C000 0000 hex

Equal to 2, but P = 1

4 = 100₂ * 2⁰ = 1₂ * 2² = 0100 0000 1000 0000 ... 0000 0000 = 4080 0000 hex

Equal Mantissa, BE = 2, K = 2 + 127 = 129 (10000001)

6 = 110₂ * 2⁰ = 1.1₂ * 2² = 0100 0000 1100 0000 ... 0000 0000 = 40C0 0000 hex

1 = 1₂ * 2⁰ = 0011 1111 1000 0000 ... 0000 0000 = 3F80 0000 hex

K = 0 + 127 (01111111).

.75 = 0.11₂ * 20 = 1.1₂ * 2^-1 = 0011 1111 0100 0000 ... 0000 0000 = 3F40 0000 hex

Special Case - 0:

Normalization of number 0 can’t produce shape 1.xxxxx

0 = 0 0000000 0000 ... like 1.0₂ * 2^-127

Range and precision of Real Numbers:

In case of Real number of standard precision, characteristic (8 bits) can be somewhere in interval [0,255].

K = 0 reserved to display zero

K = 255 reserved to display infinity

While BE = K - 127, BE can be created in interval [-126,127].

Smallest positive number different than zero which can be displayed:

1.0₂ * 2 ^‑126 ~ 1.175494350822*10 ^‑38

and the biggest is:

1.11111111111111111111111₂ * 2¹²⁷ ~2¹²⁸ = 3.402823669209*10³⁸

Precision: 24 binary digits

2²⁴~ 10^x 24 log 2 ~ x log 10 x ~ 24 log 2 ~ 7.224719895936

about first 7 digits are valid correct.

Display by numerical line:

Numerical mistake:

Not possible to use all bits while calculating:

Example: 0.0001₁₀ + 0.9900₁₀

0.0001₁₀ : (1.)10100011011011100010111₂ * 2^-14

0.9900₁₀ : (1.)11111010111000010100011₂ * 2^-1

While adding, binary spots must be one underneath the other:

#.000000000000011010001101 * 2⁰ Only 11 of 24 bits!

+.111111010111000010100011 * 2⁰

=.111111010111011100110000 * 2⁰ = 0,9900999069214₁₀

Real numbers in double precise mode

Declaration in program language C:

double

P	forsign ( P=1 negative, P=0 positive)
Characteristic	binary exponent + 1023 (11 bits)
Mantissa	normalized (52+1 bit).

Range:

K [0,2047].

K = 0 reserved for display of zero

K = 2047 reserved for display infinity

BE = K - 1023

BE [-1022,1023]

Smallest positive number different than zero which can be displayed:

1.0₂ * 2 ^‑1022 ~2.225073858507*10 ^‑308

and the biggest is:

1.1111.....111111₂ * 2¹⁰²³ ~2¹⁰²⁴ = 1.797693134862316*10³⁰⁸

Correct: 53 binary numbers

2⁵³~ 10^x 53 log 2 ~ x log 10 x ~ 53 log 2 ~ 15.95458977019

near to 16 first numbers are valid.

There is also:

long double 80 bits

Characteristic: 15 bits

Binary exponent: Characteristic - 16383

Real constants

1. 2.34 9e-8 8.345e+25 double

2f 2.34F -1.34e5f float

1.L 2.34L -2.5e-37L long double

Technorati Tags: C++, Programming, IEEE, Normalization, Mantissa, Standard Precision, Bit

9 Responses to “Lesson 2: IEEE Standards”

Anonymous on 4:14 AM

sorry for delay...reupped lesson 2 with pictures for better understanding (float and double precision display)
Anonymous on 9:26 PM

I am searching for a site like this but has materials/tutorials for writing J2SE Java App. I will be grateful if anyone can help me
Anonymous on 10:48 AM

Nice short overview. Thx for that

the polarizer
Anonymous on 1:30 PM

Guess I'll have to look elsewhere for a programming tutorial. This one assumes the user already understands many concepts.
Anonymous on 1:16 AM

Well acctually it doesn't. Try following it from Lesson 1 (found on this website) and take few days for all tutorials. Go slowly through one lesson at the time and when you're stuck, just post the question, and I'll be more than glad to help you! It was written in a Way so you don't need to have any pre programming knowledge.
Anonymous on 8:28 PM

[quote]
It was written in a Way so you don't need to have any pre programming knowledge.
[/quote]

binary exponent, mantissa, real number and normalised are not terms that you often hear down the pub. I agree with chris, this is more of a reference manual for seasoned programmers than a way to introduce basic concepts to a newbie.
Anonymous on 12:29 PM

spent a couple hours over lessons 1 + 2 and cross ref Wikipedia for definitiions and further examples- all hangs together; thanks
Anonymous on 8:14 PM

I kinda have to agree with the others. I have spend hours looking over lessons 1 & 2 and i dot understand it. You need more definitions. Thanks for the other lessons though they are good.

Grath
Anonymous on 3:57 AM

Hi!

I'm looking at a pretty simple problem that involves reading numbers from a file and storing them in a different file with a small amount of procesing and in somewhat different order, but in IEEE standard integer and floating point formats. I'd like to use a compiler that already stores the numbers in the correct formats, so that in most cases I can just copy the bytes, but in a few cases some bit fiddling will still be needed. So the question is, which Windows XP based C and C++ compilers that suppport the IEEE standard number formats would you recommend?

Most of my programming experience is with DOS based Turbo Pascal, but I did one big application in DOS based C a long time ago, and have intended to upgrade to a windows based C compiler for years, but until now never had any applications (all small and simple) for which C would be better that my very old versions of Tubrbo Pascal. So now is the time to upgrade because the application would be much more convenient to write with a C compiler that already stored the numbers in the proper format. With the Turbo Pascal compiler, I would have to reformat every number, which is very easy for the integers, and not too hard for the floats, but why waste time this way? The switch to C is well overdue.

This Website is optimized for Firefox. Users browsing with Internet Explorer may encounter problems while viewing pages.

C++ Maniac

Home

About...

News

Feed Subscription

Contact

Learn C

Lesson 1

Lesson 2

Lesson 3

Lesson 4

Lesson 5

Lesson 6

Lesson 7

Lesson 8

Lesson 9

Lesson 10

Lesson 11

Lesson 12

Lesson 13

Lesson 14

Lesson 15

Lesson 16

Lesson 17

Lesson 18

Lesson 19

Additional

Numerical Systems

Numerical Transformations

#include

math.h

stdlib.h

string.h

ctype.h

stdio.h

alloc.h

Learn Converting

Binary To Decimal

Decimal To Binary

Decimal To Hexadecimal

Decimal To Decimal

Hexadecimal To Decimal

Hex, Octal, Binary

Appendix

ASCII Table

Links

Optical Illusions

Daily Lessons for programming in Visual Studio, using C code.