Elasticsearch Mapping 中配置自定义Analyzer

2022年7月19日12:13:33

多字段特性

默认text 类型会有一个keyword 字段类型

什么情况下使用多字段?

  • 公司名字实现精确匹配
  • 使用不同的Analyzer
    • 不同语言
    • pinyin字段的搜索
    • 还支持为搜索和索引指定不同的analyzer

精确值和全文本

区别:精确值不需要做分词处理

自定义Analyzer 介绍

  • Character Filters

在Tokenizer 之前对文本进行处理,例如增加删除及替换字符,可以配置多个Character Filters 。会影响TOkenizer 的position和offset 信息。

一些自带的Character Filters

HTML strip  去除html 标签

Mapping 字符串替换

Pattern replace 正则匹配替换

示例

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "<b>hello world</b>"
}

// 使用char filter 进行替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["- => _"]
    }
  ],
  "text": "123-456, I-test! test-990 650-555-1234"
}

// 替换表情符号
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [":) => happy", ":( => sad"]
    }
  ],
  "text": "my today :) ,but :( !!!"
}
// 正则表达式
GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
  ],
  "text": "http://www.baidu.com"
}

TOkenizer

将原始的文本按照一定的规则,切分为词(term or token)

Elasticsearch 内置的Tokenizers

whitespace / standard / uax_url_email /pattern /keyword/ path hierarchy

可以用java 开发插件,实现自己的Tokenizer

示例:

// 文件路径切分
POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/usr/local/a/b/c/d/e"
}

Token Filters

将Tokenizer 输出的单词(term), 进行增加,修改,删除

自带的Token Filters

lowercase / stop / synonym (添加近义词)

示例

// whitespace 与stop 以空格切分,并且去掉in  the on 介词
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": ["The rain in Spain falls mainly on the plain."]
}
// whitespace 加入lowercase 后,介词The 改为小写被删除  
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

设置一个自定义Analyzer

// 创建索引指定分词器
PUT my_inx
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": [
            "emoticons"
            ],
            "tokenizer": "punctuation",
            "filter": [
              "lowercase",
              "english_stop"
              ]
        }
      },
      "tokenizer": {
        "punctuation": {
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
            ]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_englist_"
        }
      }
    }
  }
}

POST my_inx/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) persion, and you?"
}
  • 作者:gnufre
  • 原文链接:https://blog.csdn.net/gnufre/article/details/106230967
    更新时间:2022年7月19日12:13:33 ,共 1979 字。