Floating Point Numbers


Floating-point numbers – if you’ve programmed,
you’ve probably used these, right? Floating point numbers are numbers that support
decimal points. Now, we have an idea of how computers store
numbers hopefully. If you don’t, there are video resources that
you can refer to. Watch that video first to get an idea of how
integers, whole numbers, are being stored by computers, and then come back here and
we’ll talk about the decimals all right. You’re watching another random Wednesday episode
on 0612 TV! Hello and welcome back to another random Wednesday
episode! Today let’s talk floating point. Clearly, decimal numbers are not that easy
to represent. We’re gonna first very briefly talk about
fixed point numbers. You know how when you’re working with a normal
integer, well, you’ve got your different bits, and each bit basically represents whether
you want to switch on or off a particular power of two, right? So this combination is gonna give you these
values that you add together, and that gives you this answer. The easiest way to do decimal numbers is to
simply change up those powers of two on top. What I can say is let’s assign a decimal point
smack in the middle. The left side of the decimal point is where
our normal powers of two start. On the right side we have negative powers
of two. Yes what this means is this is half. this is 1/4, this an eigth, this is a sixteenth. So yeah we can do that right, and you can
start to have decimal numbers. But watch what happens! Originally we could represent zero to 255
with an unsigned 8-bit number. So we’ve got a fairly good range, but now
because of how we’ve sort of you know reallocated our bits here, they now represent a much smaller
range of numbers right? Because I essentially only have a four bit
integer on the left, I only have 0 to 15 and on my right side, this doesn’t give me all
the decimal numbers that I can possibly represent within this range. For example if I wanted 2.5 I could do that
right, I’ll get two on the left side by you know doing it the usual way, and if I want
point-five, then the 2^-1 bit will be switched on – that’s half. So in that context it’s all well and good. Everything works just fine. But that’s just because I chose an example
that worked. Let’s take a look at some other numbers that
will not work with this very simple fixed point scheme. Starting first with something like 16. Well clearly you can’t do that because you
only have four integer bits. The biggest number you can represent is 15,
so there is an overflow in it’s this situation. Same idea actually applies on the decimal
portion as well. If you want to represent 2 to the power of
-5 you’re out of luck because there is no such bit. Of course nothing’s stopping us from having
both on the same number right, so you get overflows on both ends. These examples are just simple ones for this
particular fixed point representation. For all three of these problems, we could
technically attempt to solve it by simply adding more bits. We have 8 bits total here, if we have 16 then
of course we have more numbers on each side. However there is still great in flexibility. For example we wouldn’t have a problem storing
this value if we were able to shift the decimal point somewhere else. In fact as long as it shifts by just one place
then this will be ok, but unfortunately these are fixed point numbers and that doesn’t happen. Of course there are other restrictions at
play here as well for example if you try to represent 0.2, as a fraction that’s 1/5 and
there is no possible negative power of 2 that can exactly represent this. Now this problem runs deep – When we move
on to floating point numbers, this problem cannot be solved either. So really we’re just including this here for
completeness sake. So fixed point numbers are good step towards
having decimal numbers, but not quite good enough because most numbers cannot be represented
properly. This, ladies and gentlemen is where the floating
point number comes into play. We’re gonna now enter the world of 32-bit
numbers all right? So yeah every number we’re gonna talk about
from this point on is made up of a total of 32 bits, that’s how most computers deal with
it anyway. Here’s the idea – instead of using the entire
32 bits to represent one number, we’re gonna break up these bits into three parts, and
they’re gonna represent three different numbers in essentially a mathematical equation, that
we can eventually evaluate. You see, the ingenious way in which a floating-point
number makes use of its 32 bits is, the first bit represents a sign just like in a signed
number. The next eight bits represent this thing called
an exponent, while the remainder is this thing called a mantissa. We use these three numbers in an equation
like this. Of course, the sign simply determines whether,
well, we’re gonna have a positive or negative number. The mantissa refers to the body of the number
itself, and the exponent is used as a “2 to a power of something”. What’s really cool about this is that, well,
no matter what tje mantissa is, you can play with the exponent and get a very small number
or a very big number. That’s why it’s kind of floating point. It doesn’t have a fixed decimal point somewhere
within the bit string. Instead it uses the exponent, allowing you
to shift the decimal points to basically anywhere and that’s the power. Now that in a nutshell is how floating-point
numbers work, and if the only one a surface understanding then we can stop here. But not me, because we’re gonna delve even
deeper into this – we’re gonna construct our own floating point number. This is where things get a little bit math-sy
and messy at the same time so, you know, prepare yourselves. How I’m gonna do this right is I’m going to
just fix a number to start off with. I’m not gonna tell you what the number is,
but I’m gonna show you the 32-bit bit string. It looks like this and I’ve already separated
the sign, exponent and mantissa parts, so yeah we already have three parts – Let’s now
try and figure out how each part actually works. The sign is the easiest part – if that bit
is 0 the number is positive, if that bit is 1 the number is negative. Done! One third of the problem clear. Let’s move on to the exponent. Now the exponent is interesting because how
you read it off it’s just like any old unsigned 8-bit number. So let’s go ahead and read it out – as you
can see, if we were to use our powers, do our usual math, we get a number. Now that’s a huge number for an exponent,
and there’s a reason why that is. Don’t forget – that is an unsigned number,
but while we’re dealing with floating-point numbers, we will want negative exponents to
make small numbers. How they deal with this problem – how they
reintroduce sign back into the equation is that a number is being offset. In fact the actual number is the value you
get minus 127. What this means is if you see that number
as zero, then the actual exponent represented that is negative 127. If you see 128 the actual exponent is one. Hopefully that makes sense to you. We’ve offset that number so that you can represent
positive and negative numbers. You just got to do a bit of math to recover
the actual number you’re supposed to have. So exponent done. Let’s now move on to our mantissa – Our largest
part consisting of 23 bits. Here’s how our mantissa works. If you cast your mind back to fixed point
numbers, well our mantissa works the same way. 2^-1, 2^-2, it’s all negative powers starting
from -1. So yeah it’s basically our usual bit math
again, but this time we need to introduce one more thing and that’s our 2 to the power
of 0. As it turns out while there is no bit for
it, it is on by default. Therefore no matter what the rest of the mantissa
says we always add 1 to it. Now how I’m gonna approach the next step is
I’m gonna convert all those numbers into fractions. The reason why I convert them into fractions
is because I don’t want to do decimal math just yet. Remember we’re discussing how decimals work
right, we don’t have decimals to do that. So we’ll leave everything as fractions and
what this allows us to do is to plug these fractions in to our final equation Remember our final equation right, we’ve got
out of sign we’ve got our two to the power of the exponent that we’ve calculated, we’ve
got our mantissa and we need to multiply everything together. Again I’m gonna do this in terms of fractions
until the very end. Essentially once we solve this part, we end
up with one single gigantic fraction that looks like this. Since we have one fraction that’s essentially
a division which allows us to derive our final value, and that is this decimal value. What we’ve done is we’ve just worked our way
from the binary representation all the way down back to the original decimal. That bit string up there gives us this decimal
value. So that’s pretty cool, we’ve just cracked
a floating-point number. Of course we can do the reverse and I’m gonna
go a little bit faster through that because it’s a lot of divisions. But yeah the way in which we turn a decimal
number into a bit string is also fairly straightforward. If converting from binary to decimal is repeated
multiplication as in the powers of two, then converting backwards to binary is repeated
division. For this part we’ll start by taking the integer
portion and just repeatedly dividing it by two. Each time we divide, we are more concerned
about the remainder than the actual result of the division in this case 17 divided by
2 gives us a remainder of 1. We proceed on with the quotient 8 and we basically
repeat this procedure. If we keep going we’ll end up with a set of
remainders that can only be either 0 or 1. We’ll have to keep going until we stop at
0. Now, it happens that for this particular example
this bit string can be read in whichever direction and it looks the same, but strictly speaking
you’re gonna need to read this upwards. The order is important, you gotta start at
the bottom. Anyway this is our integer portion done, let’s
move on to the decimal portion. Now because we are now dealing with negative
powers of two we are again doing the inverse, so how this is done is we are essentially
doing doubles each time. Doubling the number gives us, well another
number and essentially our result for that part, our “remainder” so to speak checks the
integer portion of this number. If it’s zero then the result is zero and we
simply carry on. Rinse and repeat and essentially what happens
is, well in this case we’ll end up at a value that is 1 or greater. When you get that, the bit being reflected
here will become 1, and we’ll subtract one from this before carrying on. It happens that in this case because it’s
perfectly one, after the subtraction we get zero so we stop. If it’s not then you have to continue the
process. Now in this case because this example is fairly
simplistic, we’re done. For the decimal portion we read off the bits
from top to bottom. Now as mentioned this is an extremely simple
case, but there may be some decimal values that keep on going nonstop. In this case we use zero point seven and as
you can see, no matter what you do you will never end up at one. If you get a value that’s say 1.6 right, you’ll
take out the one, you’ll continue on with 0.6 and as it turns out, this never ends. We know this for sure because well, we have
a point that leads us back to essentially the same thing. It’s a pattern that repeats itself. This tells us for sure that this particular
sequence will go on forever. There are two ways in which we can stop this
process – Either when we recognize a repetition like this, or when we have enough bits to
work with. Since our mantissa has a limit at length we
don’t have to keep on going. Once we have enough bits that’s that. So what we essentially have now is one bit
string for the integer portion, and one bit string for the decimal portion. Essentially we have a fixed point number if
we were to assemble these two parts together. So yeah if we were dealing with fixed point
we could stop here, but what we’re doing here is floating points so we have to fit everything
into the mold of sign, mantissa and exponent. Let’s start by figuring out the exponent. Now essentially what we have right now can
be expressed as multiplied by 2 to the power of 0. This of course just means 1. right? And the whole thing just doesn’t change. But what we can do is we can shift the decimal
point. Every shift to the left increases the exponent
by 1. Every shift to the right decreases the exponent. Now in this case because the number is, well
quite large, we are shifting to the left and our goal is to keep on doing this until our
decimal point ends up here. Essentially we want to stop when there is
only one one before the decimal point. This by the way is why we always assume there
is a 1. If it was 0 then we can stop at a different
place and express that with a different exponent. What it means is we technically already have
everything. Of course we know the sign right, we can figure
that out by just inspecting the original number. But we also know the mantissa. It’s basically everything that comes after
the decimal point. And our exponent is simply the power up here. Of course we need to do one more step with
the exponent right? Remember our exponent is offsets so we need
to add 127 to this number giving us 131 which we can then convert back to binary. So again we’re doing that multiple division
thing, right, I won’t go through the steps of you again, it’s the same set of steps. At the end of the day 131 gives us this bit
string – That is our exponent. Since we now have all three parts we can now
assemble everything together starting with the sign which is of course 0. The exponent which have just calculated with
the value of 131, and finally our mantissa. Of course this needs to total up to 32 bits
so we simply pad out the rest the mantissa with zeros. So these continue represent the rest of your
negative powers of two, but we don’t need them. We don’t use them so yeah, we just leave them
as zero and that’s basically it. What we have here is the same bit string that
we used earlier to get the value 17.125. So there you have it! That is your floating point number. Now we’ve only discussed 32-bit floating point
numbers today, but the logic works the same if you are dealing with a double. A double is a floating point number in 64
bits. In two words – it’s a double right, we use
double the space, so we get better precision. The sign bit remains one bit, we’ve got a
few more exponent bits, and a whole lot more mantissa bits! What this means of course is that well the
whole discarding thing happens much later down the line, and as a result of that we
can have much better decimal values. That’s why usually we like doubles, right,
because yeah there tends to be less problems with that, assuming of course your system
supports it and it’s able to do it quick. So yeah that’s it! That’s flowing point numbers. If you want to play with floating point numbers,
if you want to see the working and math, I have set up a little program to do this on
the website. So yeah go ahead and click the link on screen
or in the video description and, well play with it! Have fun of it, take a look at how the decimal
numbers are being broken down. Anyway that’s all there is for this particular
episode, I hope you found it useful but until next time, you’re watching 0612TV with NERDfirst.net.

, , , , , , , , , , , , , , , , ,

Post navigation

9 thoughts on “Floating Point Numbers

  1. There is an edge case called "Denormal Numbers" that has not been accounted for in this video. Thanks to @PRANAnomaly for sharing! I'll be working on adding more information, probably in a separate video.

  2. You definitely earned this sub. This was an amazingly edited video and the information was thoroughly explained, much appreciated! If you want to find the range of this representation is it take all 1s for all the exponent bits for the largest possible exponent, take 1s for all the mantissa bits, add one to the mantissa and then complete the calculation from negative sign to positive sign?

  3. Could you do a video including the rounding when the FPN is equal 0.5, my functiom I've written can't get that right cause I can't figure out which data I'm fiddling with and when

  4. I have been struggling with this chapter for days and you just explained it in 15 minutes ?? I wish I had found this video sooner. Thank you so so much, this was extremely helpful 😀

Leave a Reply

Your email address will not be published. Required fields are marked *