es学习(三)：分词器介绍以及中文分词器ik的安装与使用

什么是分词

把文本转换为一个个的单词，分词称之为analysis。es默认只对英文语句做分词，中文不支持，每个中文字都会被拆分为独立的个体。

示例

POST http://192.168.247.8:9200/_analyze

{

	"analyzer":"standard",

	"text":"good good study"

}

# 返回

{

    "tokens": [

        {

            "token": "good",

            "start_offset": 0,

            "end_offset": 4,

            "type": "<ALPHANUM>",

            "position": 0

        },

        {

            "token": "good",

            "start_offset": 5,

            "end_offset": 9,

            "type": "<ALPHANUM>",

            "position": 1

        },

        {

            "token": "study",

            "start_offset": 10,

            "end_offset": 15,

            "type": "<ALPHANUM>",

            "position": 2

        }

    ]

}

如果想在某个索引下进行分词

POST /my_doc/_analyze

{

    "analyzer": "standard",

    "field": "name",

    "text": "text文本"

}

es内置分词器

standard：默认分词，单词会被拆分，大小会转换为小写。
simple：按照非字母分词。大写转为小写。
whitespace：按照空格分词。忽略大小写。
stop：去除无意义单词，比如the/a/an/is…
keyword：不做分词。把整个文本作为一个单独的关键词

建立ik中文分词器

下载

Github：https://github.com/medcl/elasticsearch-analysis-ik

这里需要选择和你的es版本一致的ik。我的是7.5.1

解压

[root@localhost software]# ls

elasticsearch-7.5.1-linux-x86_64.tar.gz  elasticsearch-analysis-ik-7.5.1.zip

[root@localhost software]# unzip elasticsearch-analysis-ik-7.5.1.zip -d /usr/local/elasticsearch-7.5.1/plugins/ik

重启es

ik_max_word 和 ik_smart 什么区别?

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合，适合 Term Query；
ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”，适合 Phrase 查询。

测试

POST http://192.168.247.8:9200/_analyze

{

	"analyzer":"ik_max_word",

	"text":"上下班做公交"

}

# 返回

{

    "tokens": [

        {

            "token": "上下班",

            "start_offset": 0,

            "end_offset": 3,

            "type": "CN_WORD",

            "position": 0

        },

        {

            "token": "上下",

            "start_offset": 0,

            "end_offset": 2,

            "type": "CN_WORD",

            "position": 1

        },

        {

            "token": "下班",

            "start_offset": 1,

            "end_offset": 3,

            "type": "CN_WORD",

            "position": 2

        },

        {

            "token": "做",

            "start_offset": 3,

            "end_offset": 4,

            "type": "CN_CHAR",

            "position": 3

        },

        {

            "token": "公交",

            "start_offset": 4,

            "end_offset": 6,

            "type": "CN_WORD",

            "position": 4

        }

    ]

}

自定义中文词库

1.进入IKAnalyzer.cfg.xml 配置如下

	<!--用户可以在这里配置自己的扩展字典 -->

	<entry key="ext_dict">custom.dic</entry>

2.保存后再同级目录下建立custom.dic

[esuser@localhost config]$  cat custom.dic

崔神

牛皮

3.重启es

4.测试

POST http://192.168.247.8:9200/_analyze

{

	"analyzer":"ik_smart",

	"text":"崔神牛皮"

}

# 返回

{

    "tokens": [

        {

            "token": "崔神",

            "start_offset": 0,

            "end_offset": 2,

            "type": "CN_WORD",

            "position": 0

        },

        {

            "token": "牛皮",

            "start_offset": 2,

            "end_offset": 4,

            "type": "CN_WORD",

            "position": 1

        }

    ]

}

巴特西

es学习(三)：分词器介绍以及中文分词器ik的安装与使用

什么是分词

示例

es内置分词器

建立ik中文分词器

下载

解压

重启es

ik_max_word 和 ik_smart 什么区别?

测试

自定义中文词库

最新文章

热门文章