Elasticsearch Ngram Analyzer搜索部分Mac地址

时间:2015-04-08 17:26:38

标签: ruby-on-rails elasticsearch n-gram

使用ElasticSearch(和Rails)我尝试使用连字符作为分隔符在包含mac地址的字段上索引和执行搜索查询失败:

  

24 A4-3C-02-37-26

搜索整个mac地址(未编入索引)时一切都很顺利,但我无法使用自定义分析器进行零件匹配。

我已经测试了许多选项,包括调整最小/最大值但没有成功。

使用下面的映射,设置和查询,我得到以下结果:

Box.search(q: "24-A4-3C-02-37-26").results.map(&:macaddress)

这产生了一个奇怪的结果:

["24-A4-3C-02-37-xx", "DC-9F-DB-F6-B2-xx", "C4-10-8A-13-53-xx", "C4-10-8A-13-54-xx", "C4-10-8A-13-52-xx"]

如果我删除了最后一个八位字节(" 24-A4-3C-02-37"),我明白了:

["DC-9F-DB-F6-B2-xx", "C4-10-8A-13-53-xx", "C4-10-8A-13-52-xx"]

哪个错了。

我已经使用API​​检查了分析仪,看起来只是膨胀:

curl "localhost:9205/boxes/_analyze?analyzer=ngram_analyzer&pretty=true" -d "24-A4-3C-02-37-26"

哪个收益率:

{
  "tokens" : [ {
    "token" : "24",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "24-",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "24-A",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 3
  }, {
  .........

所以我只能猜测我的实际查询有什么东西。我甚至尝试用ascii替换连字符或逃避。

@search_definition[:query] = {
  multi_match: {
    query: options[:q],
    fields: [
      "macaddress.ngram",
      "macaddress.sortable^5",
        ...

我的设置如下:

settings analysis: {
  analyzer: {
    ngram_analyzer: {
      type: 'custom',
      tokenizer: 'my_tokenizer',
    }
  },
  tokenizer: {
    my_tokenizer: {
      type: "edgeNGram",
      min_gram: 2,
      max_gram: 17,
      # token_chars: [ "letter", "digit" ]
    }
  }
} do

  mapping do
    indexes :macaddress, type: 'multi_field', fields: {
      raw: { type: "string" },
      sortable: { type: "string", index: "not_analyzed" },
      ngram: { type: "string", index_analyzer: :ngram_analyzer } #, search_analyzer: 'keyword' }
    }
    end
end

有人可以建议我如何让它发挥作用吗?

1 个答案:

答案 0 :(得分:1)

我已通过以下设置验证:

PUT test
    {
        "settings" : {
            "analysis" : {
                "analyzer" : {
                    "ngram_analyzer" : {
                        "type": "custom",
                        "tokenizer" : "my_tokenizer"
                    }
                },
                "tokenizer" : {
                    "my_tokenizer" : {
                        "type" : "edgeNGram",
                        "min_gram" : "2",
                        "max_gram" : "17"
                    }
                }
            }
        },
        "mappings": {
          "boxes":{
            "properties": {
              "macaddress":{
                "type": "multi_field",
                "fields": {
                  "raw":{
                    "type": "string"
                  },
                  "sortable":{
                    "type": "string",
                    "index": "not_analyzed"
                  },
                  "ngram":{
                    "type": "string",
                    "index_analyzer": "ngram_analyzer"
                  }
                }
              }
            }
          }
        }
    }

以及一些示例数据:

PUT test/boxes/1
{
  "macaddress":"24-A4-3C-02-37-26"
}
PUT test/boxes/2
{
  "macaddress":"24-A4-3C-02-37-54"
}
PUT test/boxes/3
{
  "macaddress":"24-A4-3C-02-38-23"
}
PUT test/boxes/4
{
"macaddress":"34-A4-3C-02-38-23"
}

搜索查询:

GET test/boxes/_search
{
  "query": {
    "multi_match": {
      "query": "24-A4-3C-02",
      "fields": ["macaddress.ngram",
      "macaddress.sortable^5"]
    }
  }
}

结果是:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 0.047079325,
      "hits": [
         {
            "_index": "test",
            "_type": "boxes",
            "_id": "1",
            "_score": 0.047079325,
            "_source": {
               "macaddress": "24-A4-3C-02-37-26"
            }
         },
         {
            "_index": "test",
            "_type": "boxes",
            "_id": "2",
            "_score": 0.047079325,
            "_source": {
               "macaddress": "24-A4-3C-02-37-54"
            }
         },
         {
            "_index": "test",
            "_type": "boxes",
            "_id": "3",
            "_score": 0.047079325,
            "_source": {
               "macaddress": "24-A4-3C-02-38-23"
            }
         }
      ]
   }
}