转摘自:https://elasticstack.blog.csdn.net/article/details/114261636

Elasticsearch 是一个应用非常广泛的搜索引擎。它可以对文字进行分词,从而实现全文搜索。在实际的使用中,我们会发现有一些文字中包含一些表情符号,比如笑脸,动物等等,那么我们该如何对这些表情符号来进行搜索呢?

     => , light skin tone, skin tone, type 1–2
=> , medium-light skin tone, skin tone, type 3
=> , medium skin tone, skin tone, type 4
=> , medium-dark skin tone, skin tone, type 5
=> , dark skin tone, skin tone, type 6
♪ => ♪, eighth, music, note
♭ => ♭, bemolle, flat, music, note
♯ => ♯, dièse, diesis, music, note, sharp
=> , face, grin, grinning face
=> , face, grinning face with big eyes, mouth, open, smile
=> , eye, face, grinning face with smiling eyes, mouth, open, smile
=> , beaming face with smiling eyes, eye, face, grin, smile
=> , face, grinning squinting face, laugh, mouth, satisfied, smile
=> , cold, face, grinning face with sweat, open, smile, sweat
=> , face, floor, laugh, rofl, rolling, rolling on the floor laughing, rotfl
=> , face, face with tears of joy, joy, laugh, tear
=> , face, slightly smiling face, smile
=> , face, upside-down
=> , face, wink, winking face => , tiger
=> , leopard
=> , face, horse
=> , equestrian, horse, racehorse, racing
=> , face, unicorn
=> , stripe, zebra
=> , deer 在上面,我们可以看到各种各样的 emoji 符号。比如我们想搜索 grin,那么它就把含有 emoji 符号的文档也找出来。在今天的文章中,我们来展示如何实现对 emoji 符号的进行搜索。 安装 如果你还没有对 Elasticsearch 及 Kibana 进行安装的话,请参阅之前的文章 “Elastic:菜鸟上手指南” 进行安装。 另外,我们必须安装 ICU analyzer。关于 ICU analyzer 的安装,请参阅之前的文章 “Elasticsearch:ICU 分词器介绍”。我们在 Elasticsearch 的安装根目录中,打入如下的命令: ./bin/elasticsearch-plugin install analysis-icu 等安装好后,我们需要重新启动 Elasticsearch 让它起作用。运行: ./bin/elasticsearch-plugin list 上面的命令显示: $ ./bin/elasticsearch-plugin install analysis-icu
-> Installing analysis-icu
-> Downloading analysis-icu from elastic
[=================================================] 100%
-> Installed analysis-icu
$ ./bin/elasticsearch-plugin list
analysis-icu 安装完 ICU analyzer 后,我们必须重新启动 Elasticsearch。 搜索 emoji 符号 我们先做一个简单的实验: GET /_analyze
{
"tokenizer": "icu_tokenizer",
"text": "I live in and I'm ‍"
} 上面使用 icu_tokenizer 来对 “I live in and I'm ‍” 进行分词。 ‍ 表情符号非常独特,因为它是更经典的 和 表情符号的组合。 中国的国旗也很特别,它是 和 的组合。 因此,我们不仅在谈论正确地分割 Unicode 代码点,而且在这里真正地了解了表情符号。 上面的请求的返回结果为: {
"tokens" : [
{
"token" : "I",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "live",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "in",
"start_offset" : 7,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : """""",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<EMOJI>",
"position" : 3
},
{
"token" : "and",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "I'm",
"start_offset" : 20,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : """‍""",
"start_offset" : 24,
"end_offset" : 29,
"type" : "<EMOJI>",
"position" : 6
}
]
} 显然 emoji 的符号被正确地分词,并能被搜索。 在实际的使用中,我们可能并不限限于对这些 emoji 的符号的搜索。比如我们想对如下的文档进行搜索: PUT emoji-capable/_doc/1
{
"content": "I like "
} 上面的文档中含有一个 ,也就是老虎。针对上面的文档,我们想搜索 tiger 的时候,也能正确地搜索到文档,那么我们该如何去做呢? 在 github 上面,有一个项目叫做 https://github.com/jolicode/emoji-search/。在它的项目中,有一个目录 https://github.com/jolicode/emoji-search/tree/master/synonyms。这里其实就是同义词的目录。我们现在下载其中的一个文件 https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-en.txt 到 Elasticsearch 的本地安装目录: config
├── analysis
│ ├── cldr-emoji-annotation-synonyms-en.txt
│ └── emoticons.txt
├── elasticsearch.yml
... 在我的电脑上: $ pwd
/Users/liuxg/elastic1/elasticsearch-7.11.0/config
$ tree -L 3
.
├── analysis
│ └── cldr-emoji-annotation-synonyms-en.txt
├── elasticsearch.keystore
├── elasticsearch.yml
├── jvm.options
├── jvm.options.d
├── log4j2.properties
├── role_mapping.yml
├── roles.yml
├── users
└── users_roles 在上面的 cldr-emoji-annotation-synonyms-en.txt 的文件中,它包含了常见 emoji 的符号的同义词。比如: => , face, grin, grinning face
=> , face, grinning face with big eyes, mouth, open, smile
=> , eye, face, grinning face with smiling eyes, mouth, open, smile
=> , beaming face with smiling eyes, eye, face, grin, smile
=> , face, grinning squinting face, laugh, mouth, satisfied, smile
=> , cold, face, grinning face with sweat, open, smile, sweat
.... 为此,我们来进行如下的实验: PUT /emoji-capable
{
"settings": {
"analysis": {
"filter": {
"english_emoji": {
"type": "synonym",
"synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
}
},
"analyzer": {
"english_with_emoji": {
"tokenizer": "icu_tokenizer",
"filter": [
"english_emoji"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "english_with_emoji"
}
}
}
} 在上面,我们定义了 english_with_emoji 分词器,同时我们在定义 content 字段时也使用相同的分词器 english_with_emoji。我们使用 _analyze API 来进行如下的使用: GET emoji-capable/_analyze
{
"analyzer": "english_with_emoji",
"text": "I like "
} 上面的命令返回: {
"tokens" : [
{
"token" : "I",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "like",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : """""",
"start_offset" : 7,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 2
},
{
"token" : "tiger",
"start_offset" : 7,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 2
}
]
} 显然它除了返回 , 也同时返回了 tiger 这样的 token。也就是说我们可以同时搜索这两种,都可以搜索到这个文档。同样地: GET emoji-capable/_analyze
{
"analyzer": "english_with_emoji",
"text": " means happy"
} 它返回: {
"tokens" : [
{
"token" : """""",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "face",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "grin",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "grinning",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "means",
"start_offset" : 3,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "face",
"start_offset" : 3,
"end_offset" : 8,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "happy",
"start_offset" : 9,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 2
}
]
} 它表明,如果我们搜索 face, grinning,grin,该文档也会被正确地返回。 现在,我们输入如下的两个文档: PUT emoji-capable/_doc/1
{
"content": "I like "
} PUT emoji-capable/_doc/2
{
"content": " means happy"
} 我们对文档进行如下的搜索: GET emoji-capable/_search
{
"query": {
"match": {
"content": ""
}
}
} 或: GET emoji-capable/_search
{
"query": {
"match": {
"content": "tiger"
}
}
} 他们都将返回第一个文档: {
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.8514803,
"hits" : [
{
"_index" : "emoji-capable",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.8514803,
"_source" : {
"content" : """I like """
}
}
]
}
} 通用地,我们进行如下的搜索: GET emoji-capable/_search
{
"query": {
"match": {
"content": ""
}
}
} 或者: GET emoji-capable/_search
{
"query": {
"match": {
"content": "grin"
}
}
} 它们都将返回第二个文档: {
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.8514803,
"hits" : [
{
"_index" : "emoji-capable",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.8514803,
"_source" : {
"content" : """ means happy"""
}
}
]
}
}

最新文章

  1. idea之resource配置
  2. shell获取ip
  3. ajax异步举例
  4. bzoj1622 [Usaco2008 Open]Word Power 名字的能量
  5. ssh整合启动tomcat报java.lang.ClassNotFoundException: org.apache.commons.lang.xwork.StringUtils
  6. 水平居中的两种方法margin text-align
  7. yum使用总结(转)
  8. [ios2]componentsSeparatedByCharactersInSet使用方法
  9. Test 17
  10. vue实例讲解之vue-router的使用
  11. phpstorm-----实现实时编辑服务器代码
  12. 前端之 HTML&#127875;
  13. Oracle 触发器的使用
  14. char 与 String 相等比较
  15. Ubuntu16.04安装cuda9.0+cudnn7.0
  16. 二、jspxcms使用-用户和模型
  17. Java-Method类常用方法详解
  18. OpenStack实践系列⑦深入理解neutron和虚拟机
  19. MVVM软件设计模式(转)
  20. C# .ToString()格式化 常用数据转化小总结

热门文章

  1. java.super详解
  2. 各大厂的语音识别Speech To Text API使用体验
  3. vue2,vue指令和选项
  4. CSS(十四):盒子模型
  5. CF1703B ICPC Balloons 题解
  6. BootStrapBlazor 安装教程--Server模式
  7. vue 数据更新了但视图没改变?试试 $set
  8. CF Edu Round 131 简要题解 (ABCD)
  9. 基于微前端qiankun的多页签缓存方案实践
  10. V8中的快慢数组(附源码、图文更易理解&#128515;)