ES scroll(ES游标) 解决深分页。

Why

当Elasticsearch响应请求时,它必须确定docs的顺序,排列响应结果。如果请求的页数较少(假设每页20个docs), Elasticsearch不会有什么问题,但是如果页数较大时,比如请求第20页,Elasticsearch不得不取出第1页到第20页的所有docs,再去除第1页到第19页的docs,得到第20页的docs。

原理

Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left. It’s a bit like a cursor in a traditional database.

A scrolled search takes a snapshot in time(适时). 中间更新不可见。

1
2
<code>By keeping old data files around.
</code>

深分页的代价是全局排序,若禁止排序,sort by _doc,return the next batch of results from every shard that still has results to return.

context keepalive time(当批够用) 和 scroll_id(最新)

Set the scroll value to the length of time we want to keep the scroll window open.

How long it should keep the “search context” alive.

The scroll expiry time is refreshed every time we run a scroll request,所以不宜过长(垃圾)、过短(超时),够处理一批数据即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
<code>GET /old_index/_search?scroll=1m //第1次请求
{
    "query": { "match_all": {}},
    "sort" : ["_doc"], //the most efficient sort order
    "size"1000
}
返回结果包含:_scroll_id ,base-64编码的字符串
 
GET /_search/scroll  //后续请求
{
    "scroll": "1m",
    "scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzhTYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs="
}</code>

scroll parameter : how long it should keep the search context alive,long enough to process the previous batch of results, each scroll request sets a new expiry time.

An open search context prevents the old segments from being deleted while they are still in use.

注意:Keeping older segments alive means that more file handles(FD) are needed.

检查有多少search contexts(open_contexts):

1
<code>GET _nodes/stats/indices/search</code>

Clear scroll API

Search context are automatically removed when the scroll timeout has been exceeded.

1
2
<code>清所有,可以清部分(无意义):
DELETE _search/scroll/_all</code>

size

When scanning, the size is applied to each shard, 真实size是:size * number_of_primary_shards.

否则(regular scroll),返还总的size。

查询结束

No more hits are returned. Each call to the scroll API returns the next batch of results until there are no more results left to return, ie the hits array is empty.

适用场景

Scrolling is not intended for real time(实时) user requests, but rather for processing large amounts of data.

scroll目的,不是处理实时的用户请求,而是为处理大数据的。

似快照

The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.

聚合

If the request specifies aggs, only the initial search response will contain the aggs results.

顺序无关

不关心返回文档的顺序!

Scroll requests have optimizations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option:

1
2
3
4
5
6
<code>GET /_search?scroll=1m
{
  "sort": [
    "_doc"
  ]
}</code>

slice scroll

split the scroll in multiple slices

scanning and standard scroll

scanning scroll与standard scroll 查询几点不同:

1. scanning scroll 结果没有排序,结果顺序是doc入库时的顺序;

2. scanning scroll 不支持聚合

3. scanning scroll 最初查询结果的“hits”列表中不会包含结果

4. scanning scroll 最初查询中如果设定了“size”,是设定每个分片(shard)size的数量,若size=3,有5个shard,每次返回结果的最大值就是3*5=15。

示例

常见问题

scroll_id一样与否

1
2
<code><code>the scroll_id may change over the course of multiple calls and so it is required to always pass the most recent scroll_id as the scroll_id for the subsequent request.
</code></code>

异常:SearchContextMissingException

SearchContextMissingException[No search context found for id [721283]];

原因:scroll设置的时间过短了。

源码212">问源码(2.1.2)

scroll_id的生成:

…search.type.TransportSearchHelper#buildScrollId(…) 三个参数,搜索查询类型、结果信息、查询条件参数 TransportSearchQueryThenFetchAction.AsyncAction. finishHim()

最新文章

  1. Apache POI 实现对 Excel 文件读写
  2. 【iCore3 双核心板_FPGA】实验二十:基于FIFO的ARM+FPGA数据存取实验
  3. 消灭ASP.NET CachedPathData.ValidatePath引起的HttpException异常
  4. 重复ID的记录,只显示其中1条
  5. java security
  6. aspx页面中获取当前浏览器url
  7. Unix/Linux环境C编程入门教程(12) openSUSECCPP以及Linux内核驱动开发环境搭建
  8. linux下各种代理的设置
  9. AndroidAndroid程序提示和消息button响应事件
  10. 安装oracle11数据库时,先决条件都失败怎么处理?
  11. Linux文件系统的介绍
  12. parquet文件 读取 原理
  13. https----------如何在phpstudy环境下配置apache的https访问以及访问http自动跳转成https
  14. MongoDB连接
  15. maven project 报错解决方法
  16. R语言-散点图阵
  17. 二分搜索-poj2785
  18. [LeetCode] 98. Validate Binary Search Tree_Medium
  19. 洛谷P4107 [HEOI2015]兔子与樱花 [贪心,DFS]
  20. windows系统安装ubuntu双系统

热门文章

  1. ArcGIS教程:加权总和
  2. QMap的性能,只要超过10个元素,就被QHash彻底拉开差距
  3. [jzoj 6073] 河 解题报告 (DP)
  4. JS轮播图动态渲染四种方法
  5. .net core虚拟目录配置
  6. java使用FileUtils文件操作神器
  7. Git Learning Part I - Install Git and configure it
  8. Mybatis xml约束文件的使用
  9. jquery mobile动态加载数据后无法渲染
  10. 利用Axis2默认口令安全漏洞入侵WebService网站