Elasticsearch集群使用ik分词器

IK分词插件的安装

ES集群环境

VMWare下三台虚拟机Ubuntu 14.04.2 LTS
JDK 1.8.0_66
Elasticsearch 2.3.1
elasticsearch-jdbc-2.3.1.0
IK分词器1.9.1
clustername：my-application

分配如下表：

虚拟机 | IP | node-x

----|----

search1 | 192.168.235.133 | node-1

search2 |192.168.235.134 | node-2

search3 |192.168.235.135 | node-3

IK分词器下载与编译

在github下载IK分词器zip包：

https://github.com/myitroad/elasticsearch-analysis-ik

解压后导入IntelliJ IDEA为maven工程。

生成jar包

使用IntelliJ IDEA maven的terminal工具，执行：

mvn clean

mvn compile

mvn package

在F:\workspace_idea\elasticsearch-analysis-ik-master\target\releases生成：

elasticsearch-analysis-ik-1.9.1.zip

上传IK分词器

将上述zip包上传Elasticsearch的node-x（择一即可，如node-1），解压到：

/home/es/cluster/elasticsearch-2.3.1/plugins/ik目录，

最终的ik文件夹内目录为：

ik

│   ├── commons-codec-1.9.jar

│   ├── commons-logging-1.2.jar

│   ├── config

│   │   └── ik

│   │       ├── custom

│   │       │   ├── ext_stopword.dic

│   │       │   ├── mydict.dic

│   │       │   ├── single_word.dic

│   │       │   ├── single_word_full.dic

│   │       │   ├── single_word_low_freq.dic

│   │       │   └── sougou.dic

│   │       ├── IKAnalyzer.cfg.xml

│   │       ├── main.dic

│   │       ├── preposition.dic

│   │       ├── quantifier.dic

│   │       ├── stopword.dic

│   │       ├── suffix.dic

│   │       └── surname.dic

│   ├── elasticsearch-analysis-ik-1.9.1.jar

│   ├── httpclient-4.4.1.jar

│   ├── httpcore-4.4.1.jar

│   └── plugin-descriptor.properties

配置词库（ik自带搜狗词库）

配置：$ES_HOME/plugins/ik/config/ik/IKAnalyzer.cfg.xml

添加以下配置：

<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic;custom/sougou.dic</entry>

重启节点node-1

测试IK分词效果

默认_analyze分析命令可能造成中文乱码，因此对中文使用URL编码。

%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA是“我是中国人”的URL转码。

若直接使用“我是中国人”测试分词，则可能会返回乱码。

使用IK的ik_max_word最大分词

es@search1:~/cluster/elasticsearch-2.3.1$ curl -XGET 'localhost:9200/myindex/_analyze?analyzer=ik_max_word&text=%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA&pretty'

返回分词结果：

{

  "tokens" : [ {

    "token" : "我是",

    "start_offset" : 0,

    "end_offset" : 2,

    "type" : "CN_WORD",

    "position" : 0

  }, {

    "token" : "我",

    "start_offset" : 0,

    "end_offset" : 1,

    "type" : "CN_WORD",

    "position" : 1

  }, {

    "token" : "是中国人",

    "start_offset" : 1,

    "end_offset" : 5,

    "type" : "CN_WORD",

    "position" : 2

  }, {

    "token" : "中国人",

    "start_offset" : 2,

    "end_offset" : 5,

    "type" : "CN_WORD",

    "position" : 3

  }, {

    "token" : "中国",

    "start_offset" : 2,

    "end_offset" : 4,

    "type" : "CN_WORD",

    "position" : 4

  }, {

    "token" : "国人",

    "start_offset" : 3,

    "end_offset" : 5,

    "type" : "CN_WORD",

    "position" : 5

  }, {

    "token" : "人",

    "start_offset" : 4,

    "end_offset" : 5,

    "type" : "CN_WORD",

    "position" : 6

  } ]

}

使用IK的ik_smart最小分词

es@search1:~/cluster/elasticsearch-2.3.1$ curl -XGET 'localhost:9200/myindex/_analyze?analyzer=ik_smart&text=%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA&pretty'

{

  "tokens" : [ {

    "token" : "我是",

    "start_offset" : 0,

    "end_offset" : 2,

    "type" : "CN_WORD",

    "position" : 0

  }, {

    "token" : "中国人",

    "start_offset" : 2,

    "end_offset" : 5,

    "type" : "CN_WORD",

    "position" : 1

  } ]

}

使用IK分词器导入MySQL数据

建立myindex索引

在node-1上执行：

curl -XPUT 'localhost:9200/myindex?pretty'

编写MySQL导入es脚本mysql-es-all.sh：（存放位置可任意）

#!/bin/sh

bin=/home/es/cluster/elasticsearch-2.3.1/elasticsearch-jdbc-2.3.1.0/bin

lib=/home/es/cluster/elasticsearch-2.3.1/elasticsearch-jdbc-2.3.1.0/lib

echo '

{

    "type" : "jdbc",

    "jdbc" : {

        "locale" : "zh_CN",

        "statefile" : "statefile.json",

        "timezone" : "GMT+8",

        "autocommit" : true,

        "elasticsearch" : {

            "cluster" : "my-application",

            "host" : "192.168.235.133",

            "port" : "9300"

        },

        "index" : "myindex",

        "type" : "mytype",

        "url" : "jdbc:mysql://10.110.1.47:3306/ispider_data",

        "user" : "root",

        "password" : "xxx",

        "sql" : "select uuid as _id,title,content,release_time from JCY_VOICE_NEWS_INFO",

        "metrics" : {

            "enabled" : true,

            "interval" : "5m"

        },

        "index_settings" : {

            "index" : {

                "number_of_shards" : 2,

                "number_of_replicas" : 2

            }

        },

        "type_mapping": {

            "mytype" : {

                "properties" : {

                    "title" : {

                        "type" : "string",

                        "store": "no",

                        "term_vector": "with_positions_offsets",

                        "analyzer": "ik_max_word",

                        "search_analyzer": "ik_max_word",

                        "include_in_all": "true"

                    },

                    "content" : {

                        "type" : "string",

                        "store": "no",

                        "term_vector": "with_positions_offsets",

                        "analyzer": "ik_max_word",

                        "search_analyzer": "ik_max_word",

                        "include_in_all": "true"

                    },

                    "release_time":{

                        "type":"date",

                        "store":"no",

                        "format":"YYYY-MM-dd HH:mm:ss",

                        "index":"not_analyzed",

                        "include_in_all":"true"

                    }

                }

            }

        }

    }

}

' | java \

    -cp "${lib}/*" \

    -Dlog4j.configurationFile=${bin}/log4j2.xml \

    org.xbib.tools.Runner \

    org.xbib.tools.JDBCImporter

添加运行权限并运行脚本

es@search1:~/cluster/elasticsearch-2.3.1$chmod +x mysql-es-all.sh

es@search1:~/cluster/elasticsearch-2.3.1$./mysql-es-all.sh

参考资料

IK Analysis for Elasticsearch

https://github.com/myitroad/elasticsearch-analysis-ik
[LNMP]全文检索方案：分布式Elasticsearch+Mysql

http://www.jianshu.com/p/638ff7b848cc
Elasticsearch中文乱码问题的解决（_analyze过程）

http://www.52brt.com/2015/09/19/Elasticsearch中文乱码问题的解决/
在线编码转换

http://tool.oschina.net/encode?type=4

巴特西