如何剥离撇号?

时间:2017-02-02 14:13:31

标签: elasticsearch

这里定义:

  

撇号令牌过滤器在撇号后删除所有字符,   包括撇号本身。

尝试在它们后面删除撇号和字符。当只有一个撇号时,过滤器根本不会剥离任何东西。此外,当存在多个连续的撇号时,它会分割相关的单词,但在撇号之后不会删除任何内容。显然,我必须遗漏一些东西。

使用单撇号输入:

POST localhost:9200/_analyze?
{
    "filter": ["apostrophe"],
    "text": "apple banana'orange kiwi"
}

输出

{
  "tokens": [
    {
      "token": "apple",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "banana'orange",
      "start_offset": 6,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "kiwi",
      "start_offset": 20,
      "end_offset": 24,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

使用多个连续撇号输入。

{
    "filter": ["apostrophe"],
    "text": "apple banana''orange kiwi"
}

输出

{
  "tokens": [
    {
      "token": "apple",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "banana",
      "start_offset": 6,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "orange",
      "start_offset": 14,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "kiwi",
      "start_offset": 21,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

1 个答案:

答案 0 :(得分:1)

如果您单独使用令牌过滤器,它将无法正常工作,因为standard分析器将启动并标记您的输入,apostrophe令牌过滤器将被忽略。如果您添加explain参数,则会获得有关正在进行的更多信息:

curl -XPOST 'localhost:9200/_analyze?pretty&filter=apostrophe&explain' -d "apple banana'orange kiwi"
{
  "detail" : {
    "custom_analyzer" : false,
    "analyzer" : {
      "name" : "standard",
      "tokens" : [ {
        "token" : "apple",
        "start_offset" : 0,
        "end_offset" : 5,
        "type" : "<ALPHANUM>",
        "position" : 0,
        "bytes" : "[61 70 70 6c 65]",
        "positionLength" : 1
      }, {
        "token" : "banana'orange",
        "start_offset" : 6,
        "end_offset" : 19,
        "type" : "<ALPHANUM>",
        "position" : 1,
        "bytes" : "[62 61 6e 61 6e 61 27 6f 72 61 6e 67 65]",
        "positionLength" : 1
      }, {
        "token" : "kiwi",
        "start_offset" : 20,
        "end_offset" : 24,
        "type" : "<ALPHANUM>",
        "position" : 2,
        "bytes" : "[6b 69 77 69]",
        "positionLength" : 1
      } ]
    }
  }
}

如您所见,以上只是使用standard分析器。

要解决此问题,您只需指定至少一个标记器。如果您使用standard标记生成器,则它会按预期工作。您现在可以看到使用standard标记器和apostrophe令牌过滤器的自定义分析器,现在可以正常工作。

curl -XPOST 'localhost:9200/_analyze?pretty&tokenizer=standard&filter=apostrophe&explain' -d "apple banana'orange kiwi"
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "standard",
      "tokens" : [ {
        "token" : "apple",
        "start_offset" : 0,
        "end_offset" : 5,
        "type" : "<ALPHANUM>",
        "position" : 0,
        "bytes" : "[61 70 70 6c 65]",
        "positionLength" : 1
      }, {
        "token" : "banana'orange",
        "start_offset" : 6,
        "end_offset" : 19,
        "type" : "<ALPHANUM>",
        "position" : 1,
        "bytes" : "[62 61 6e 61 6e 61 27 6f 72 61 6e 67 65]",
        "positionLength" : 1
      }, {
        "token" : "kiwi",
        "start_offset" : 20,
        "end_offset" : 24,
        "type" : "<ALPHANUM>",
        "position" : 2,
        "bytes" : "[6b 69 77 69]",
        "positionLength" : 1
      } ]
    },
    "tokenfilters" : [ {
      "name" : "apostrophe",
      "tokens" : [ {
        "token" : "apple",
        "start_offset" : 0,
        "end_offset" : 5,
        "type" : "<ALPHANUM>",
        "position" : 0,
        "bytes" : "[61 70 70 6c 65]",
        "positionLength" : 1
      }, {
        "token" : "banana",
        "start_offset" : 6,
        "end_offset" : 19,
        "type" : "<ALPHANUM>",
        "position" : 1,
        "bytes" : "[62 61 6e 61 6e 61]",
        "positionLength" : 1
      }, {
        "token" : "kiwi",
        "start_offset" : 20,
        "end_offset" : 24,
        "type" : "<ALPHANUM>",
        "position" : 2,
        "bytes" : "[6b 69 77 69]",
        "positionLength" : 1
      } ]
    } ]
  }
}