弹性搜索交叉字段,边缘图分析器

时间:2016-05-12 18:59:28

标签: java amazon-web-services elasticsearch full-text-search search-engine

我有999个文件用于试验弹性搜索。

我的类型映射中有一个字段f4,它被分析并具有以下分析器设置:

  "myNGramAnalyzer" => [
       "type" => "custom",
        "char_filter" => ["html_strip"],
        "tokenizer" => "standard",
        "filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"]
  ]

我的过滤器如下:

  "filter" => [
        "ngram_filter" => [
            "type" => "edgeNGram",
            "min_gram" => "2",
            "max_gram" => "20"
        ]
  ]

我对字段f4有价值" Proj1"," Proj2"," Proj3" ......等等。

现在,当我尝试使用交叉字段进行搜索时," proj1"字符串,我期待与" Proj1"以最高分数在响应的顶部返回。但它并没有。休息所有数据的内容几乎相同。

我也不明白为什么它与所有999文件相符?

以下是我的搜索:

{
    "index": "myindex",
    "type": "mytype",
    "body": {
        "query": {
            "multi_match": {
                "query": "proj1",
                "type": "cross_fields",
                "operator": "and",
                "fields": "f*"
            }
        },
        "filter": {
            "term": {
                "deleted": "0"
            }
        }
    }
}

我的搜索回复是:

{
    "took": 12,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 999,
        "max_score": 1,
        "hits": [{
            "_index": "myindex",
            "_type": "mytype",
            "_id": "42",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "125650","f3": "BH.1511AI.001",
                "f4": "Proj42",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        }, {
            "_index": "myindex",
            "_type": "mytype",
            "_id": "47",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "137946","f3": "BH.152096.001",
                "f4": "Proj47",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        }, 
        //.......
        //.......
        //MANY RECORDS IN BETWEEN HERE
        //.......
        //.......
        {
            "_index": myindex,
            "_type": "mytype",
            "_id": "1",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "142095","f3": "BH.705215.001",
                "f4": "Proj1",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        //.......
        //.......
        //MANY RECORDS IN BETWEEN HERE
        //.......
        //.......
        }]
    }
}

我做错了什么或错过了什么? (对于冗长的问题道歉,但我想尽可能地丢弃不必要的其他代码)。

已编辑:

术语向量响应

{
    "_index": "myindex",
    "_type": "mytype",
    "_id": "10",
    "_version": 1,
    "found": true,
    "took": 9,
    "term_vectors": {
        "f4": {
            "field_statistics": {
                "sum_doc_freq": 5886,
                "doc_count": 999,
                "sum_ttf": 5886
            },
            "terms": {
                "pr": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "pro": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj1": {
                    "doc_freq": 111,
                    "ttf": 111,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj10": {
                    "doc_freq": 11,
                    "ttf": 11,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                }
            }
        }
    }
}

已编辑2

字段f4的映射

"f4" : {
    "type" : "string",
    "index_analyzer" : "myNGramAnalyzer",
    "search_analyzer" : "standard"
}

我已更新使用标准分析器查询时间,这改善了结果但仍然不符合我的预期。

而不是999(所有文件)现在它返回111个文件,如" Proj1"," Proj11"," Proj111" ......&# 34; Proj1"," Proj181" .........等。

仍然" Proj1"在结果之间而不在顶部。

2 个答案:

答案 0 :(得分:1)

没有index_analyzer(至少不是Elasticsearch版本1.7)。对于mapping parameters,您可以使用analyzersearch_analyzer。 请尝试以下步骤以使其正常工作。

使用分析器设置创建myindex:

PUT /myindex
{
   "settings": {
     "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "edge_ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "myNGramAnalyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "char_filter": "html_strip",
               "filter": [
                  "lowercase",
                  "standard",
                  "asciifolding",
                  "stop",
                  "snowball",
                  "ngram_filter"
               ]
            }
         }
      }
   }
}

向mytype添加映射(为了简化我只是映射了相关的字段):

PUT /myindex/_mapping/mytype
{
   "properties": {
      "f1": {
         "type": "string"
      },
      "f4": {
         "type": "string",
         "analyzer": "myNGramAnalyzer",
         "search_analyzer": "standard"
      },
      "deleted": {
         "type": "string"
      }
   }
}

索引一些数据:

PUT myindex/mytype/1
{
    "f1":"396",
    "f4":"Proj12" ,
    "deleted": "0"
}

PUT myindex/mytype/2
{
    "f1":"42",
    "f4":"Proj22" ,
    "deleted": "1"
}

现在尝试查询:

GET myindex/mytype/_search
{
   "query": {
      "multi_match": {
         "query": "proj1",
         "type": "cross_fields",
         "operator": "and",
         "fields": "f*"
      }
   },
   "filter": {
      "term": {
         "deleted": "0"
      }
   }
}

它应该返回文档#1Sense对我有用。我使用的是Elasticsearch 2.X个版本。

希望我能帮助:)

答案 1 :(得分:0)

经过几个小时的花时间寻找解决方案,我终于成功了。

所以我保持一切与我的问题中提到的一样,使用n gram analzyer同时索引数据。我唯一需要更改的是,将我的搜索查询中的all字段用作现有multi-match查询的bool查询。

现在我的搜索结果Proj1的结果会以Proj1Proj121Proj11等订单返回结果。

虽然这不会返回Proj1Proj11Proj121等确切的顺序,但它仍然非常类似于我想要的结果。