1024programmer Blog In-depth understanding of floating-point number implementation principles, range and precision, and the problem of large numbers eating small numbers_loongknown’s blog

In-depth understanding of floating-point number implementation principles, range and precision, and the problem of large numbers eating small numbers_loongknown’s blog

class=”markdown_views prism-atom-one-light”>

In-depth understanding of floating-point number implementation principles, number range and precision, and the problem of large numbers eating small numbers

After reading this article, you will have a deep understanding of the principles of floating-point number implementation: normalized representation, non-normalized representation, +/- 0, INF, NAN, floating-point number representation range and precision. And will also figure out the principle of big numbers eating small numbers and the corresponding solutions.

A few days ago, a colleague and I discussed the precision of floating-point numbers and the problem of large numbers eating decimals. I just took this opportunity to write this article on the implementation principle of floating-point numbers and the problem of large numbers eating decimals, which is also convenient for later people. See similar questions.

Floating point number implementation principle

Here we take IEEE-754 single-precision floating-point numbers as an example, and double-precision floating-point numbers are similar.

Single-precision floating-point numbers are stored in computers as 32 bits. These 32 bits are divided into three parts: sign bit (Sign), exponent bit (Exponent) and mantissa bit (Mantissa, Fraction):

  • Sign bit: Indicates the sign of the value, that is, positive or negative. Occupies 1 bit.
  • Exponent bit: Indicates the exponent part of the value. Occupies 8 bits.
  • Mantissa digit: Indicates the base part of the value. Occupies 23 bits.

V = ( − 1 ) S ∗ M ∗ 2 E V = (-1)^S * M * 2^E V=(1)SM2 E

Insert picture description here

Normalized representation

Final part:

Normalization (Normal) means that the decimal part must be normalized first, that is, expressed in the form of 1.xx * 2 ^ exp. For example:

0.09375 has a binary representation of 0.00011 and a normalized representation of 1.1 * 2 ^ (-4).

Here you can also see why the normalized representation is needed: the mantissa is fixed at 23 bits, and if there are many leading zeros in the mantissa, it will occupy many effective bits. Using normalized representation can remove these leading zeros and make the effective bits in disguise More, it also improves the representation precision of floating-point numbers.

In addition, a characteristic of normalized representation is that the first digit of the effective number is 1, and this 1 can also be omitted.

return 0;
}

The output result is: 30000000.0000000000

Scheme 2: Segmented sum

The essence of the piecewise summation idea is to avoid the direct addition of larger numbers and smaller numbers.

 #  include
 
  int main() {
     
      double sum = 0.0;
      double x = 1.0;
      for (int i = 0; i   10000000; i++) {
     
          sum += x;
      }
      for (int i = 0; i   10000000; i++) {
     
          sum += x;
      }
      for (int i = 0; i   10000000; i++) {
     
          sum += x;
      }
      printf("%1.10f\n", sum);
      return 0;
  }
 

The output result is: 30000000.0000000000

Scheme Three: Kahan Summation

The idea of ​​Kahan’s summation algorithm: save the significant digits lost by the smaller number during the addition of the larger number and the smaller number, and then compensate it. Kahan summation compensation is introduced in more detail in Wikipedia, and students who want to know can go to here.

 #  include
 
  int main() {
     
      double sum = 0.0;
      double x = 1.0;
      float eps = 0.0;
      float t = 0.0;
      for (int i = 0; i   30000000; i++) {
     
          float y = x - eps;
          t = sum + y;
          eps = t - sum -y;
          sum = t;
      }
      printf("%1.10f\n", sum);
      return 0;
  }
 

The output result is: 30000000.0000000000

At this point, I think it should be clear about the implementation principle of floating point numbers and the problem of large numbers eating small numbers.

Reference

  • [1] In-depth understanding of computer systems
  • [2] https://en.wikipedia.org/wiki/Kahan_summation_algorithm

en keyword”>for (int i = span> 0; i 30000000; i++) {

float y = x eps;
t = sum + y;
eps = t sum y;
sum = t;
}
printf(“%1.10f\n”, sum);
return 0;
}

The output result is: 30000000.0000000000

At this point, I think it should be clear about the implementation principle of floating point numbers and the problem of large numbers eating small numbers.

Reference

  • [1] In-depth understanding of computer systems
  • [2] https://en.wikipedia.org/wiki/Kahan_summation_algorithm

author: admin

Previous article
Next article

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us

Contact us

181-3619-1160

Online consultation: QQ交谈

E-mail: [email protected]

Working hours: Monday to Friday, 9:00-17:30, holidays off

Follow wechat
Scan wechat and follow us

Scan wechat and follow us

Follow Weibo
Back to top
首页
微信
电话
搜索