Floating-point numbers – if you’ve programmed,

you’ve probably used these, right? Floating point numbers are numbers that support

decimal points. Now, we have an idea of how computers store

numbers hopefully. If you don’t, there are video resources that

you can refer to. Watch that video first to get an idea of how

integers, whole numbers, are being stored by computers, and then come back here and

we’ll talk about the decimals all right. You’re watching another random Wednesday episode

on 0612 TV! Hello and welcome back to another random Wednesday

episode! Today let’s talk floating point. Clearly, decimal numbers are not that easy

to represent. We’re gonna first very briefly talk about

fixed point numbers. You know how when you’re working with a normal

integer, well, you’ve got your different bits, and each bit basically represents whether

you want to switch on or off a particular power of two, right? So this combination is gonna give you these

values that you add together, and that gives you this answer. The easiest way to do decimal numbers is to

simply change up those powers of two on top. What I can say is let’s assign a decimal point

smack in the middle. The left side of the decimal point is where

our normal powers of two start. On the right side we have negative powers

of two. Yes what this means is this is half. this is 1/4, this an eigth, this is a sixteenth. So yeah we can do that right, and you can

start to have decimal numbers. But watch what happens! Originally we could represent zero to 255

with an unsigned 8-bit number. So we’ve got a fairly good range, but now

because of how we’ve sort of you know reallocated our bits here, they now represent a much smaller

range of numbers right? Because I essentially only have a four bit

integer on the left, I only have 0 to 15 and on my right side, this doesn’t give me all

the decimal numbers that I can possibly represent within this range. For example if I wanted 2.5 I could do that

right, I’ll get two on the left side by you know doing it the usual way, and if I want

point-five, then the 2^-1 bit will be switched on – that’s half. So in that context it’s all well and good. Everything works just fine. But that’s just because I chose an example

that worked. Let’s take a look at some other numbers that

will not work with this very simple fixed point scheme. Starting first with something like 16. Well clearly you can’t do that because you

only have four integer bits. The biggest number you can represent is 15,

so there is an overflow in it’s this situation. Same idea actually applies on the decimal

portion as well. If you want to represent 2 to the power of

-5 you’re out of luck because there is no such bit. Of course nothing’s stopping us from having

both on the same number right, so you get overflows on both ends. These examples are just simple ones for this

particular fixed point representation. For all three of these problems, we could

technically attempt to solve it by simply adding more bits. We have 8 bits total here, if we have 16 then

of course we have more numbers on each side. However there is still great in flexibility. For example we wouldn’t have a problem storing

this value if we were able to shift the decimal point somewhere else. In fact as long as it shifts by just one place

then this will be ok, but unfortunately these are fixed point numbers and that doesn’t happen. Of course there are other restrictions at

play here as well for example if you try to represent 0.2, as a fraction that’s 1/5 and

there is no possible negative power of 2 that can exactly represent this. Now this problem runs deep – When we move

on to floating point numbers, this problem cannot be solved either. So really we’re just including this here for

completeness sake. So fixed point numbers are good step towards

having decimal numbers, but not quite good enough because most numbers cannot be represented

properly. This, ladies and gentlemen is where the floating

point number comes into play. We’re gonna now enter the world of 32-bit

numbers all right? So yeah every number we’re gonna talk about

from this point on is made up of a total of 32 bits, that’s how most computers deal with

it anyway. Here’s the idea – instead of using the entire

32 bits to represent one number, we’re gonna break up these bits into three parts, and

they’re gonna represent three different numbers in essentially a mathematical equation, that

we can eventually evaluate. You see, the ingenious way in which a floating-point

number makes use of its 32 bits is, the first bit represents a sign just like in a signed

number. The next eight bits represent this thing called

an exponent, while the remainder is this thing called a mantissa. We use these three numbers in an equation

like this. Of course, the sign simply determines whether,

well, we’re gonna have a positive or negative number. The mantissa refers to the body of the number

itself, and the exponent is used as a “2 to a power of something”. What’s really cool about this is that, well,

no matter what tje mantissa is, you can play with the exponent and get a very small number

or a very big number. That’s why it’s kind of floating point. It doesn’t have a fixed decimal point somewhere

within the bit string. Instead it uses the exponent, allowing you

to shift the decimal points to basically anywhere and that’s the power. Now that in a nutshell is how floating-point

numbers work, and if the only one a surface understanding then we can stop here. But not me, because we’re gonna delve even

deeper into this – we’re gonna construct our own floating point number. This is where things get a little bit math-sy

and messy at the same time so, you know, prepare yourselves. How I’m gonna do this right is I’m going to

just fix a number to start off with. I’m not gonna tell you what the number is,

but I’m gonna show you the 32-bit bit string. It looks like this and I’ve already separated

the sign, exponent and mantissa parts, so yeah we already have three parts – Let’s now

try and figure out how each part actually works. The sign is the easiest part – if that bit

is 0 the number is positive, if that bit is 1 the number is negative. Done! One third of the problem clear. Let’s move on to the exponent. Now the exponent is interesting because how

you read it off it’s just like any old unsigned 8-bit number. So let’s go ahead and read it out – as you

can see, if we were to use our powers, do our usual math, we get a number. Now that’s a huge number for an exponent,

and there’s a reason why that is. Don’t forget – that is an unsigned number,

but while we’re dealing with floating-point numbers, we will want negative exponents to

make small numbers. How they deal with this problem – how they

reintroduce sign back into the equation is that a number is being offset. In fact the actual number is the value you

get minus 127. What this means is if you see that number

as zero, then the actual exponent represented that is negative 127. If you see 128 the actual exponent is one. Hopefully that makes sense to you. We’ve offset that number so that you can represent

positive and negative numbers. You just got to do a bit of math to recover

the actual number you’re supposed to have. So exponent done. Let’s now move on to our mantissa – Our largest

part consisting of 23 bits. Here’s how our mantissa works. If you cast your mind back to fixed point

numbers, well our mantissa works the same way. 2^-1, 2^-2, it’s all negative powers starting

from -1. So yeah it’s basically our usual bit math

again, but this time we need to introduce one more thing and that’s our 2 to the power

of 0. As it turns out while there is no bit for

it, it is on by default. Therefore no matter what the rest of the mantissa

says we always add 1 to it. Now how I’m gonna approach the next step is

I’m gonna convert all those numbers into fractions. The reason why I convert them into fractions

is because I don’t want to do decimal math just yet. Remember we’re discussing how decimals work

right, we don’t have decimals to do that. So we’ll leave everything as fractions and

what this allows us to do is to plug these fractions in to our final equation Remember our final equation right, we’ve got

out of sign we’ve got our two to the power of the exponent that we’ve calculated, we’ve

got our mantissa and we need to multiply everything together. Again I’m gonna do this in terms of fractions

until the very end. Essentially once we solve this part, we end

up with one single gigantic fraction that looks like this. Since we have one fraction that’s essentially

a division which allows us to derive our final value, and that is this decimal value. What we’ve done is we’ve just worked our way

from the binary representation all the way down back to the original decimal. That bit string up there gives us this decimal

value. So that’s pretty cool, we’ve just cracked

a floating-point number. Of course we can do the reverse and I’m gonna

go a little bit faster through that because it’s a lot of divisions. But yeah the way in which we turn a decimal

number into a bit string is also fairly straightforward. If converting from binary to decimal is repeated

multiplication as in the powers of two, then converting backwards to binary is repeated

division. For this part we’ll start by taking the integer

portion and just repeatedly dividing it by two. Each time we divide, we are more concerned

about the remainder than the actual result of the division in this case 17 divided by

2 gives us a remainder of 1. We proceed on with the quotient 8 and we basically

repeat this procedure. If we keep going we’ll end up with a set of

remainders that can only be either 0 or 1. We’ll have to keep going until we stop at

0. Now, it happens that for this particular example

this bit string can be read in whichever direction and it looks the same, but strictly speaking

you’re gonna need to read this upwards. The order is important, you gotta start at

the bottom. Anyway this is our integer portion done, let’s

move on to the decimal portion. Now because we are now dealing with negative

powers of two we are again doing the inverse, so how this is done is we are essentially

doing doubles each time. Doubling the number gives us, well another

number and essentially our result for that part, our “remainder” so to speak checks the

integer portion of this number. If it’s zero then the result is zero and we

simply carry on. Rinse and repeat and essentially what happens

is, well in this case we’ll end up at a value that is 1 or greater. When you get that, the bit being reflected

here will become 1, and we’ll subtract one from this before carrying on. It happens that in this case because it’s

perfectly one, after the subtraction we get zero so we stop. If it’s not then you have to continue the

process. Now in this case because this example is fairly

simplistic, we’re done. For the decimal portion we read off the bits

from top to bottom. Now as mentioned this is an extremely simple

case, but there may be some decimal values that keep on going nonstop. In this case we use zero point seven and as

you can see, no matter what you do you will never end up at one. If you get a value that’s say 1.6 right, you’ll

take out the one, you’ll continue on with 0.6 and as it turns out, this never ends. We know this for sure because well, we have

a point that leads us back to essentially the same thing. It’s a pattern that repeats itself. This tells us for sure that this particular

sequence will go on forever. There are two ways in which we can stop this

process – Either when we recognize a repetition like this, or when we have enough bits to

work with. Since our mantissa has a limit at length we

don’t have to keep on going. Once we have enough bits that’s that. So what we essentially have now is one bit

string for the integer portion, and one bit string for the decimal portion. Essentially we have a fixed point number if

we were to assemble these two parts together. So yeah if we were dealing with fixed point

we could stop here, but what we’re doing here is floating points so we have to fit everything

into the mold of sign, mantissa and exponent. Let’s start by figuring out the exponent. Now essentially what we have right now can

be expressed as multiplied by 2 to the power of 0. This of course just means 1. right? And the whole thing just doesn’t change. But what we can do is we can shift the decimal

point. Every shift to the left increases the exponent

by 1. Every shift to the right decreases the exponent. Now in this case because the number is, well

quite large, we are shifting to the left and our goal is to keep on doing this until our

decimal point ends up here. Essentially we want to stop when there is

only one one before the decimal point. This by the way is why we always assume there

is a 1. If it was 0 then we can stop at a different

place and express that with a different exponent. What it means is we technically already have

everything. Of course we know the sign right, we can figure

that out by just inspecting the original number. But we also know the mantissa. It’s basically everything that comes after

the decimal point. And our exponent is simply the power up here. Of course we need to do one more step with

the exponent right? Remember our exponent is offsets so we need

to add 127 to this number giving us 131 which we can then convert back to binary. So again we’re doing that multiple division

thing, right, I won’t go through the steps of you again, it’s the same set of steps. At the end of the day 131 gives us this bit

string – That is our exponent. Since we now have all three parts we can now

assemble everything together starting with the sign which is of course 0. The exponent which have just calculated with

the value of 131, and finally our mantissa. Of course this needs to total up to 32 bits

so we simply pad out the rest the mantissa with zeros. So these continue represent the rest of your

negative powers of two, but we don’t need them. We don’t use them so yeah, we just leave them

as zero and that’s basically it. What we have here is the same bit string that

we used earlier to get the value 17.125. So there you have it! That is your floating point number. Now we’ve only discussed 32-bit floating point

numbers today, but the logic works the same if you are dealing with a double. A double is a floating point number in 64

bits. In two words – it’s a double right, we use

double the space, so we get better precision. The sign bit remains one bit, we’ve got a

few more exponent bits, and a whole lot more mantissa bits! What this means of course is that well the

whole discarding thing happens much later down the line, and as a result of that we

can have much better decimal values. That’s why usually we like doubles, right,

because yeah there tends to be less problems with that, assuming of course your system

supports it and it’s able to do it quick. So yeah that’s it! That’s flowing point numbers. If you want to play with floating point numbers,

if you want to see the working and math, I have set up a little program to do this on

the website. So yeah go ahead and click the link on screen

or in the video description and, well play with it! Have fun of it, take a look at how the decimal

numbers are being broken down. Anyway that’s all there is for this particular

episode, I hope you found it useful but until next time, you’re watching 0612TV with NERDfirst.net.

There is an edge case called "Denormal Numbers" that has not been accounted for in this video. Thanks to @PRANAnomaly for sharing! I'll be working on adding more information, probably in a separate video.

So what do i do if during mulitpliing by 2 i get more than 8 numbers (which i guess i require for exponent?

danke

You definitely earned this sub. This was an amazingly edited video and the information was thoroughly explained, much appreciated! If you want to find the range of this representation is it take all 1s for all the exponent bits for the largest possible exponent, take 1s for all the mantissa bits, add one to the mantissa and then complete the calculation from negative sign to positive sign?

Great video!

Thank's so much for a such quick video with a such huge impact on knowledge!

Could you do a video including the rounding when the FPN is equal 0.5, my functiom I've written can't get that right cause I can't figure out which data I'm fiddling with and when

Hats of to you sir, everything is crystal clear now 🙂

This man is an incredible teacher, great video to help other humans

I have been struggling with this chapter for days and you just explained it in 15 minutes ?? I wish I had found this video sooner. Thank you so so much, this was extremely helpful 😀