1、term vector介绍

获取document中的某个field内的各个term的统计信息

term information: term frequency in the field, term positions, start and end offsets, term payloads

term statistics: 设置term_statistics=true; total term frequency, 一个term在所有document中出现的频率; document frequency,有多少document包含这个term

field statistics: document count,有多少document包含这个field; sum of document frequency,一个field中所有term的df之和; sum of total term frequency,一个field中的所有term的tf之和

GET /twitter/tweet/1/_termvectors
GET /twitter/tweet/1/_termvectors?fields=text term statistics和field statistics并不精准,不会被考虑有的doc可能被删除了 我告诉大家,其实很少用,用的时候,一般来说,就是你需要对一些数据做探查的时候。比如说,你想要看到某个term,某个词条,大话西游,这个词条,在多少个document中出现了。或者说某个field,film_desc,电影的说明信息,有多少个doc包含了这个说明信息。 2、index-iime term vector实验 term vector,涉及了很多的term和field相关的统计信息,有两种方式可以采集到这个统计信息 (1)index-time,你在mapping里配置一下,然后建立索引的时候,就直接给你生成这些term和field的统计信息了
(2)query-time,你之前没有生成过任何的Term vector信息,然后在查看term vector的时候,直接就可以看到了,会on the fly,现场计算出各种统计信息,然后返回给你 这一讲,不会手敲任何命令,直接copy我做好的命令,因为这一讲的重点,不是掌握什么搜索或者聚合的语法,而是说,掌握,如何采集term vector信息,然后如何看懂term vector信息,你能掌握利用term vector进行数据探查 PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "text",
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
} PUT /my_index/my_type/1
{
"fullname" : "Leo Li",
"text" : "hello test test test "
} PUT /my_index/my_type/2
{
"fullname" : "Leo Li",
"text" : "other hello test ..."
} GET /my_index/my_type/1/_termvectors
{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
} {
"_index": "my_index",
"_type": "my_type",
"_id": "",
"_version": 1,
"found": true,
"took": 10,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6,
"doc_count": 2,
"sum_ttf": 8
},
"terms": {
"hello": {
"doc_freq": 2,
"ttf": 2,
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 5,
"payload": "d29yZA=="
}
]
},
"test": {
"doc_freq": 2,
"ttf": 4,
"term_freq": 3,
"tokens": [
{
"position": 1,
"start_offset": 6,
"end_offset": 10,
"payload": "d29yZA=="
},
{
"position": 2,
"start_offset": 11,
"end_offset": 15,
"payload": "d29yZA=="
},
{
"position": 3,
"start_offset": 16,
"end_offset": 20,
"payload": "d29yZA=="
}
]
}
}
}
}
} 3、query-time term vector实验 GET /my_index/my_type/1/_termvectors
{
"fields" : ["fullname"],
"offsets" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
} 一般来说,如果条件允许,你就用query time的term vector就可以了,你要探查什么数据,现场去探查一下就好了 4、手动指定doc的term vector GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
} 手动指定一个doc,实际上不是要指定doc,而是要指定你想要安插的词条,hello test,那么就可以放在一个field中 将这些term分词,然后对每个term,都去计算它在现有的所有doc中的一些统计信息 这个挺有用的,可以让你手动指定要探查的term的数据情况,你就可以指定探查“大话西游”这个词条的统计信息 5、手动指定analyzer来生成term vector GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer" : {
"text": "standard"
}
} 6、terms filter GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"filter" : {
"max_num_terms" : 3,
"min_term_freq" : 1,
"min_doc_freq" : 1
}
} 这个就是说,根据term统计信息,过滤出你想要看到的term vector统计结果
也挺有用的,比如你探查数据把,可以过滤掉一些出现频率过低的term,就不考虑了 7、multi term vector GET _mtermvectors
{
"docs": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "",
"term_statistics": true
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "",
"fields": [
"text"
]
}
]
} GET /my_index/_mtermvectors
{
"docs": [
{
"_type": "test",
"_id": "",
"fields": [
"text"
],
"term_statistics": true
},
{
"_type": "test",
"_id": ""
}
]
} GET /my_index/my_type/_mtermvectors
{
"docs": [
{
"_id": "",
"fields": [
"text"
],
"term_statistics": true
},
{
"_id": ""
}
]
} GET /_mtermvectors
{
"docs": [
{
"_index": "my_index",
"_type": "my_type",
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
}
},
{
"_index": "my_index",
"_type": "my_type",
"doc" : {
"fullname" : "Leo Li",
"text" : "other hello test ..."
}
}
]
}

最新文章

  1. LOL one Key
  2. Permutations
  3. BAT实现服务器文件同步
  4. Character Timing for T=0
  5. 使用VirtualBox进行端口转发 连接数据库
  6. php在centos下的脚本没有解析的问题
  7. Java 时间日期系列目录
  8. Git的搭建和使用技巧完整精华版
  9. EditPlus自动补全、模板配置
  10. 使用WebClient上传文件时的一些问题
  11. Netbeans7.4下搭建struts2.3.16
  12. poj3233之经典矩阵乘法
  13. Html基础详解之(CSS)
  14. 自学Python5.2-类、模块、包
  15. mac os ssh远程链接centos提示证书错误解决方法
  16. 让Mysql支持Emoji表情,解决[Err] 1366 - Incorrect string value: '\xF0\xA3\x84\x83'
  17. window上安装zabbix agent使用案例
  18. selenium 操作过程中,元素标红高亮的两种实现方式
  19. 最长公共子序列lcs 51nod1006
  20. python 有class外壳不一定是oop,到底怎么oo?

热门文章

  1. Appium中wait_activity的使用以及XPATH定位
  2. DT添加七牛云对象存储插件功能
  3. Java的Socket通信简单实例
  4. MapReduce的核心运行机制
  5. Python 类的继承__init__() takes exactly 3 arguments (1 given)
  6. input提示字在有焦点消失或输入改变时消失
  7. js沉思录一:js的核心概念
  8. IMP self _cmd
  9. F Energy stones
  10. k-mean