https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps

http://roaringbitmap.org/

2015年2月18日Engineering

Frame of Reference and Roaring Bitmaps

作者

Postings lists

While it may surprise you if you are new to search engine internals, one of the most important building blocks of a search engine is the ability to efficiently compress and quickly decode sorted lists of integers. Why is this useful? As you may know, Elasticsearch shards, which are Lucene indices under the hood, split the data that they store into segments which are regularly merged together. Inside each segment, documents are given an identifier between 0 and the number of documents in the segment (up to 231-1). This is conceptually like an index in an array: it is stored nowhere but is enough to identity an item. Segments store data about documents sequentially, and a doc ID is the index of a document in a segment. So the first document in a segment would have a doc ID of 0, the second 1, etc. until the last document, which has a doc ID equal to the total number of documents in the segment minus one.

Why are these doc IDs useful? An inverted index needs to map terms to the list of documents that contain this term, called a postings list, and these doc IDs that we just discussed are a perfect fit since they can be compressed efficiently.

Frame Of Reference

In order to be able to compute intersections and unions efficiently, we require that these postings lists are sorted. A nice side-effect of this decision is that postings lists can be compressed with delta-encoding.

For instance, if your postings list is [73, 300, 302, 332, 343, 372], the list of deltas would be [73, 227, 2, 30, 11, 29]. What is interesting to note here is that all deltas are between 0 and 255, so you only need one byte per value. This is the technique that Lucene is using in order to encode your inverted index on disk: postings lists are split into blocks of 256 doc IDs and then each block is compressed separately using delta-encoding and bit packing: Lucene computes the maximum number of bits required to store deltas in a block, adds this information to the block header, and then encodes all deltas of the block using this number of bits. This encoding technique is known as Frame Of Reference (FOR) in the literature and has been used since Lucene 4.1.

Here is an example with a block size of 3 (instead of 256 in practice):

最新文章

  1. java web学习总结(十五) -------------------JSP基础语法
  2. float使内联支持宽高
  3. smarty插件开发代替注册插件方法registerPlugin
  4. System.out.println()输出到指定文件里
  5. jQuery中json对象的复制(数组及对象) .
  6. NGUI panel使用soft clip时,屏幕缩放后无法正常工作的问题解决
  7. mysql高可用探究 MMM高可用mysql方案
  8. angularjs google map markers+ ui-gmap-windows --->增加click 事件
  9. verilog中的有符号数运算
  10. 在Android上实现SSL握手(客户端需要密钥和证书),实现服务器和客户端之间Socket交互
  11. Nginx的正向代理与反向代理详解
  12. 如何在linux环境上挂载磁盘
  13. minicom for Mac 配置
  14. 《Java编程思想》读书笔记-对象导论
  15. [Swift]正则表达式工具类
  16. drf频率组件
  17. DIV+CSS+PS实现背景图的三层嵌套以及背景图的合并
  18. day39机器学习
  19. Window服务与Quartz.NET
  20. iqueryable lambda表达式

热门文章

  1. C#中RDLC合并两个列的值
  2. SpringBoot 内嵌容器的比较
  3. svn忽略idea生成的本地配置文件
  4. springboot使用aspectJ
  5. Spring AOP 实战运用
  6. [leetcode]720. Longest Word in Dictionary字典中最长的单词
  7. CSS解析
  8. 使用jmeter进行压力测试与nginx连接数优化
  9. GC算法与回收策略
  10. Netty学习之IO模型