From:  johndcook.com/blog

For a set of positive probabilities pi summing to 1, their entropy is defined as

(For this post, log will mean log base 2, not natural log.)

This post looks at a couple questions about computing entropy. First, are there any numerical problems computing entropy directly from the equation above?

Second, imagine you don’t have the pi values directly but rather counts ni that sum to N. Then pi = ni/N. To apply the equation directly, you’d first need to compute N, then go back through the data again to compute entropy. If you have a large amount of data, could you compute the entropy in one pass?

To address the second question, note that

so you can sum ni and ni log ni in the same pass.

One of the things you learn in numerical analysis is to look carefully at subtractions. Subtracting two nearly equal numbers can result in a loss of precision. Could the numbers above be nearly equal? Maybe if the ni are ridiculously large. Not just astronomically large — astronomically large numbers like the number of particles in the universe are fine — but ridiculously large, numbers whose logarithms approach the limits of machine-representable numbers. (If we’re only talking about numbers as big as the number of particles in the universe, their logs will be at most three-digit numbers).

Now to the problem of computing the sum of ni log ni. Could the order of the terms matter? This also applies to the first question of the post if we look at summing the pi log pi. In general, you’ll get better accuracy summing a lot positive numbers by sorting them and adding from smallest to largest and worse accuracy by summing largest to smallest. If adding a sorted list gives essentially the same result when summed in either direction, summing the list in any other order should too.

To test the methods discussed here, I used two sets of count data, one on the order of a million counts and the other on the order of a billion counts. Both data sets had approximately a power law distribution with counts varying over seven or eight orders of magnitude. For each data set I computed the entropy four ways: two equations times two orders. I convert the counts to probabilities and use the counts directly, and I sum smallest to largest and largest to smallest.

For the smaller data set, all four methods produced the same answer to nine significant figures. For the larger data set, all four methods produced the same answer to seven significant figures. So at least for the kind of data I’m looking at, it doesn’t matter how you calculate entropy, and you might as well use the one-pass algorithm to get the result faster.

最新文章

  1. MySQL主从复制实现
  2. 浅谈sql的字符分割
  3. api接口签名验证(MD5)
  4. j-query j-query
  5. centos7.0 php-fpm 安装ImageMagic php扩展imagick
  6. Boost的VS开发环境搭建
  7. poj_3468: A Simple Problem with Integers (树状数组区间更新)
  8. centos6 & centos 7 防火墙设置
  9. vue -- style使用scss样式报错
  10. Python面向对象3:面向对象的三大特性
  11. 金三银四季来了!Java 面试题大放送,能答对70%就去BATJTMD试试~
  12. layer.open
  13. Luogu5155 USACO18DEC Balance Beam(概率期望+凸包)
  14. 洛谷.3355.骑士共存问题(最小割ISAP)
  15. Django 事物
  16. 把查询的数据导出到elsx表 关于流的概念
  17. iOS 使用Masonry介绍与使用实践:快速上手Autolayout
  18. 基于jQuery图片自适应排列显示代码
  19. unity Input.GetAxis和Input.GetAxisRaw
  20. JavaScript 中的执行上下文和调用栈是什么?

热门文章

  1. hdu 1686 KMP模板
  2. C2第三次作业解题报告
  3. [转]编译VC++程序warning C4819快速解决
  4. ContentControl 与 ViewModel (一)
  5. Kali Linux Web 渗透测试— 第二十课-metasploit.meterpreter
  6. Getting the first day in a week with T-SQL
  7. Html5 学习系列(五)Canvas绘图API快速入门(1)
  8. [C++] 将 mp3 等音乐资源以资源形式嵌入 exe 文件中
  9. 一句话在网页右上角加一个精致下拉框:forkme on github
  10. junit批量测试