使用 ElasticSearch Aggregations 进行统计分析（转）

https://blog.csdn.net/cs729298/article/details/68926969

ElasticSearch 的特点随处可见：基于 Lucene 的分布式搜索引擎，友好的 RESTful API……

大部分文章都围绕 ELK Stack 和全文搜索展开，本文试图用一个小案例来展示 ElasticSearch Aggregations 在统计分析的强大之处。

表单长这样

需求：对回收的问卷进行统计，统计方式可能有：

看每周／天／小时回收量（可以做成可视化的柱状图，人人都爱 Dashboard）
以上需求加一个时间范围（例如最近90天）
在问题 1 中选择 A 答案的用户，其他答案的占比
问题 1 选择了 A 答案和问题 2 中选择了 B 答案的用户的其他回答占比

前两个需求都是对文档的根字段进行查询，后面的都是对子文档的字段进行搜索

可视化用了 Chart.js 和 Twitter Bootstrap；胶水语言么，自然是世界上最好的语(P)言(H)啦(P)，安装和启动过程什么的太简单就跳过了。

1. 初次见面

就像新人学习如何使用 Postgres 那样，步骤如下：

创建一个 index（index 既是名词，又是动词，这里是名词）
定义 mapping （相当于 schema）
使用 bulk 导入数据
查询（ElasticSearch 的强大之处可在这里体现）

创建 index 和定义 mapping

在 ElasticSearch 使用 index 的成本相当低，以下代码在创建 index 时也同时指定了 mapping

代码只展示关键部分（反正你们也不会去运行）

$client = Elasticsearch\ClientBuilder::create()->build();

$params = [

    'index' => 'your_awesome_data',

    'body' => [

        'mappings' => [

            'ur_radio_answers' => [

                'properties' => [

                    'answer_id' => [ #这里是字段名

                        'type' => 'string', #字段类型（不指定也行，elasticsearch 自己会猜）

                        'index' => 'not_analyzed' #告诉 elasticsearch，本字段不需要被分词，需要完整的读写）

                    ],

                    'user_id' => ['type' => 'string', 'index' => 'not_analyzed'],

                    'questions' => [

                        'type' => 'nested',

                        'properties' => [

                            'page_id' => [

                                'type' => 'string',

                                'index' => 'not_analyzed'

                            ],

                            'question_id' => ['type' => 'string', 'index' => 'not_analyzed'],

                            'question' => ['type' => 'string', 'index' => 'not_analyzed'],

                            'option' => ['type' => 'string', 'index' => 'not_analyzed']

                        ]

                    ],

                    'start_at' => [

                        'type' => 'date',

                        'format' => 'yyyy-MM-dd HH:mm:ss'

                    ],

                    'ended_at' => ['type' => 'date', 'format' => 'yyyy-MM-dd HH:mm:ss']

                ]

            ]

        ]

    ]

];

$client->indices()->create($params);

使用 bulk API 导入数据

这部分代码没啥好看，只要知道在批量导入数据的时候用 bulk API 就行了

bulk 是批量插入文档的 API，一般是将几千个 Document 一起插入（因为每插入一次就是一个 HTTP 请求）

$client = Elasticsearch\ClientBuilder::create()->build();

$connect = new mysqli('localhost', 'root', 'STUPIDPASSWORD', 'db');

$max = 823880;

$cursor = 1000;

while ($cursor < $max) {

    $result = $connect->query("select * from raw_answer_265033 where wd_oaid > {$cursor} order by wd_oaid asc limit 1000");

    $params = [];

    while ($obj = $result->fetch_array()) {

        $pages = json_decode($obj['wd_answer_json']);

        $answer = [

            'answer_id' => $obj['wd_oaid'],

            'user_id' => $obj['wd_uin'],

            'questions' => [],

            'ip' => $obj['wd_ip'],

            'start_at' => date('Y-m-d h:i:s', $obj['wd_starttime']),

            'ended_at' => date('Y-m-d h:i:s', $obj['wd_endtime'])

        ];

        foreach ($pages as $page) {

            foreach ($page->questions as $question) {

                foreach ($question->options as $option) {

                    if (isset($option->checked) && $option->checked == 1) {

                        $answer['questions'][] = [

                            'page_id' => $page->id,

                            'question_id' => $question->id,

                            'question' => trim(strip_tags(htmlspecialchars_decode($question->title))),

                            'option' => trim(strip_tags(htmlspecialchars_decode($option->text))),

                        ];

                    }

                }

            }

        }

        $cursor = $obj['wd_oaid'];

        $params['body'][] = [

            'index' => ['_index' => 'your_awesome_data', '_type' => 'your_awesome_data']

        ];

        $params['body'][] = $answer;

    }

    // 这里是重点

    $response = $client->bulk($params);

    $params = [];

}

经过上面胶水语言的拼装，单个 Document 在入库时是长这样的：

{

    "answer_id": "192013",

    "user_id": "2971957289",

    "questions": [  #这里是一个数组，数量都不一样;（在 ElasticSearch 中就是 Nested Document）

        {

            "page_id": "p-12-Y1cU",

            "question_id": "q-35-gJ9a",

            "question": "八月飘香香满园（打一地名）",

            "option": "桂林"

        },

        {

            "page_id": "p-1-e8fe",

            "question_id": "q-4-irlF",

            "question": "遥知不是雪，为有暗香来（打一《红楼梦》人名）",

            "option": "王作梅"

        },

        {

            "page_id": "p-2-8jI8",

            "question_id": "q-48-WG7d",

            "question": "单刀赴会 （打一《水浒传》人名）",

            "option": "林冲"

        }

    ],

    "ip": "223.88.92.21",

    "start_at": "2016-02-21 12:02:01",

    "ended_at": "2016-02-21 13:18:15"

}

以下是返回结果， took 属性是查询耗时，这里的空白查询花了 42ms，hits.total 表示有多少个 Document，这里有 82万，表明我们刚才的批量插入成功了

{

    "took": 42,

    "timed_out": false,

    "_shards": { "total": 5, "successful": 5, "failed": 0 },

    "hits": { "total": 822880, "max_score": 1.0, "hits": [ #这里是搜索结果，省略了 ] }

}

查询

好了，以上都只是准备工作，需求来了：

没有任何条件过滤，统计所有问题的各选项比例

这是查询语句

{

    "aggs": {

        "answers": {

            "nested": {

                "path": "questions"

            },

            "aggs": {

                "questions": {

                    "terms": {

                        "field": "questions.question",

                        "size": 100,

                        "order": {

                            "_count": "desc"

                        }

                    },

                    "aggs": {

                        "options": {

                            "terms": {

                                "field": "questions.option",

                                "size": 100,

                                "order": {

                                    "_count": "desc"

                                }

                            }

                        }

                    }

                }

            }

        },

        "dates": {

            "date_histogram": {

                "field": "ended_at",

                "interval": "day",

                "min_doc_count": 0

            },

            "aggs": {

                "user_count": {

                    "cardinality": {

                        "field": "answer_id"

                    }

                }

            }

        }

    }

}

这是返回结果，只耗时 155ms，并且在一个请求内返回了两个统计结果（ dates 和 answers ））

下一段再介绍这个查询用到的聚合

{

    "took": 155,

    "timed_out": false,

    "_shards": { "total": 5, "successful": 5, "failed": 0},

    "hits": {"total": 822880, "max_score": 0, "hits": []},

    "aggregations": {

        "dates": {

            "buckets": [

                {"key_as_string": "2016-02-22 00:00:00", "key": 1456099200000, "doc_count": 573855, "user_count": {"value": 613589}},

                {"key_as_string": "2016-02-23 00:00:00", "key": 1456185600000, "doc_count": 35533,  "user_count": {"value": 32221}}

                # 省略类似以上两条的内容

            ]

        },

        "answers": {

            "doc_count": 2738528,

            "questions": {

                "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0,

                "buckets": [

                    {   "key": "千条线，万条线， 掉到水里看不见（打一自然现象）",

                        "doc_count": 166145,

                        "options": {

                            "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0,

                            "buckets": [

                                {"key": "雨", "doc_count": 147481},

                                {"key": "雪", "doc_count": 11717},

                                {"key": "雾", "doc_count": 6947}

                            ]

                        }

                    },

                    {   "key": "细白嫩肉裹紫衣，霜儿一打不成器（打一蔬菜）",

                        "doc_count": 164585,

                        "options": {

                            "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0,

                            "buckets": [

                                {"key": "茄子", "doc_count": 136404},

                                {"key": "紫薯", "doc_count": 19811},

                                {"key": "萝卜", "doc_count": 8370}

                            ]

                        }

                    },

                    {   "key": "八月飘香香满园（打一地名）",

                        "doc_count": 164571,

                        "options": {

                            "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0,

                            "buckets": [

                                {"key": "桂林", "doc_count": 148744},

                                {"key": "厦门", "doc_count": 8963},

                                {"key": "青岛", "doc_count": 6864}

                            ]

                        }

                    }

                    # 省略类似内容

                ]

            }

        }

    }

}

直接可视化就是下图的样子

改一下需求：问题1选择 A 选项的用户是怎么选择其他选项的？

这里只现实 query 部分，省略 aggs，以下是查询

{

    "query": {

        "filtered": {

            "query": {

                "nested": {

                    "path": "questions",

                    "query": {

                        "bool": {

                            "must": [

                                {

                                    "term": {

                                        "questions.question": {

                                            "value": "千条线，万条线， 掉到水里看不见（打一自然现象）"

                                        }

                                    }

                                },

                                {

                                    "term": {

                                        "questions.option": {

                                            "value": "雨"

                                        }

                                    }

                                }

                            ]

                        }

                    }

                }

            },

            "filter": {

                "and": [

                    {

                        "range": {

                            "ended_at": {

                                "from": "2016-02-14 00:00:00",

                                "to": "2016-03-15 23:59:59"

                            }

                        }

                    }

                ]

            }

        }

    },

    "aggs": {

        #

        .

        .

        .

    }

}

返回结果，耗时差不多，还是很快的

{

    "took": 63,

    "timed_out": false,

    "_shards": { "total": 5, "successful": 5, "failed": 0 },

    "hits": { "total": 147481, "max_score": 0, "hits": [ #... ] }

}

聚合

在 ElasticSearch 中，聚合分为两种： Metrics 和 Bucket，上面的查询里包含了这两种聚合，分别展开说明

Metrics 直接计算出结果，类似 SQL 中的 sum(), min(), max(), avg(), count() 函数

Bucket 不像 Metrics 直接出指标，而且创建一堆桶(可以看到每个桶有多少数量的文档)，然后还可以再用 Sub-Aggregations 再聚合

Nested Aggregation

aggs.answers 用到了，这个聚合不出结果，只是告诉 ElasticSearch 某个字段是 Nested 的，然后再继续进行聚合

Date Histogram Aggregation

例子中的 aggs.dates 就使用了 Date Histogram，这是最常用的聚合，只要数据中包含时间字段就可以使用这个聚合。有哪些使用场景？

每月／周／日／时／分，不同周期内的数量，而且这个周期不一定是单周、单日，还可以是每2天，每3个小时 etc.
某个时间点如果没有数据， ElasticSearch 也能自动补充上这个时间点（count 为 0）

Terms Aggregation

aggs.answers.aggs.questions 中使用了两次，相当于 SQL 的 group by，属于 Bucket Aggregations

Cardinality Aggregation

相当于 SQL 的 count(distinct(FIELD))，属于 Metrics Aggregations

*还有一个很重要的概念：聚合后再聚合 Sub-Aggregations *

像例子中的 aggs.answers.aggs.questions，就是先用题目进行聚合，然后再将答案聚合一次（见 aggs.answers.aggs.questions.options），如果不使用 Sub-Aggregations 就没法讲答案放在问题下了

2. 日常使用

在导入完数据后，常规维护有哪些呢？

插入新的 Document，相当于 SQL 的 insert
更新原有的 Document，相当于 SQL 的 update
删除 Document，也就是 SQL 的 delete

插入单个 Document （例如有用户刚填完一份问卷）

以下都是从官方拷贝的例子

curl -XPUT 'localhost:9200/customer/external/1' -d '

{

  "name": "John Doe"

}'

更新原有的 Document

curl -XPOST 'localhost:9200/customer/external/1/_update' -d '

{

  "doc": { "name": "Jane Doe" }

}'

删除 Document，没有意外，如你所见，用的还是 DELETE 方法，很 RESTful

curl -XDELETE 'localhost:9200/customer/external/2'

常规的使用如果不更新字段，就跟使用 MySQL 差不多，没有太大区别

总结

查询时间

好了，这里是重点，实时计算真的很重要（否则要验证一个想法的成本都很高），在 ElasticSearch 中，对几百万行进行搜索都能在几十至几百 ms 内完成

初次导入数据耗时

从 MySQL 读取到全部塞进 ElasticSearch 花了 420秒（7分钟），文档结构简单时能更加快（每秒几万）

空间占用

本例子中 Documents 有 360万（子文档也算一个），空间占用只有 434.4MB

其他

ElasticSearch 真的很快，尤其是在数据分析领域，请不要被它的名字上的 search 给骗了

在对几百万、几千万的数据能实时搜索和聚合，同时占用空间也不大，很轻松就能造一个穷人版的 Google Analytics

ElasticSearch 为啥这么快？IEG 前同事 @wentao 写了一系列文章分享，强烈建议阅读一下：

时间序列数据库的秘密（1）—— 介绍 http://www.infoq.com/cn/articles/database-timestamp-01
时间序列数据库的秘密（2）—— 索引 http://www.infoq.com/cn/articles/database-timestamp-02
时间序列数据库的秘密（3）—— 加载和分布式计算 http://www.infoq.com/cn/articles/database-timestamp-03
时间序列数据库的选择条件 http://km.oa.com/group/24825/articles/show/223511
ElasticSearch 的测试报告 https://segmentfault.com/a/1190000002688549
MongoDB 的测试报告 https://segmentfault.com/a/1190000002690548

巴特西

使用 ElasticSearch Aggregations 进行统计分析（转）

1. 初次见面

创建 index 和定义 mapping

使用 bulk API 导入数据

查询

聚合

2. 日常使用

总结

查询时间

初次导入数据耗时

空间占用

其他

最新文章

热门文章

巴特西

使用 ElasticSearch Aggregations 进行统计分析（转）

1. 初次见面

创建 index 和 定义 mapping

使用 bulk API 导入数据

查询

聚合

2. 日常使用

总结

查询时间

初次导入数据耗时

空间占用

其他

最新文章

热门文章

创建 index 和定义 mapping