具有部分匹配和完全匹配的多个字段的弹性搜索

时间:2015-04-22 19:01:21

标签: elasticsearch

我们的Account模型包含first_namelast_namessn(社会安全号码)。

我想在first_name, last_name'上进行部分匹配,但在ssn上完全匹配。到目前为止我有这个:

settings analysis: {
    filter: {
      substring: {
        type: "nGram",
        min_gram: 3,
        max_gram: 50
      },
      ssn_string: {
        type: "nGram",
        min_gram: 9,
        max_gram: 9
      },
    },
    analyzer: {
      index_ngram_analyzer: {
        type: "custom",
        tokenizer: "standard",
        filter: ["lowercase", "substring"]
      },
      search_ngram_analyzer: {
        type: "custom",
        tokenizer: "standard",
        filter:  ["lowercase", "substring"]
      },
      ssn_ngram_analyzer: {
        type: "custom",
        tokenizer: "standard",
        filter: ["ssn_string"]
      },
     }
   }

   mapping do
    [:first_name, :last_name].each do |attribute|
      indexes attribute, type: 'string', 
                         index_analyzer: 'index_ngram_analyzer',
                         search_analyzer: 'search_ngram_analyzer'
   end

   indexes :ssn, type: 'string', index: 'not_analyzed'

  end 

我的搜索如下:

query: {
  multi_match: {
     fields: ["first_name", "last_name", "ssn"],
     query: query,
     type: "cross_fields",
     operator: "and"
  }

}

这样可行:

 Account.search("erik").records.to_a

甚至(对于Erik Smith):

 Account.search("erik smi").records.to_a

和ssn:

 Account.search("111112222").records.to_a

但不是:

 Account.search("erik 111112222").records.to_a

我是否正在编制索引或查询错误?

感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

是否必须使用单个查询字符串?如果没有,我会做这样的事情:

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "ngram_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "ngram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "_all": {
            "enabled": true,
            "index_analyzer": "ngram_analyzer",
            "search_analyzer": "standard"
         },
         "properties": {
            "first_name": {
               "type": "string",
               "include_in_all": true
            },
            "last_name": {
               "type": "string",
               "include_in_all": true
            },
            "ssn": {
               "type": "string",
               "index": "not_analyzed",
               "include_in_all": false
            }
         }
      }
   }
}

请注意_all field的使用。我在first_name中添加了last_name_all,但没有ssnssn根本没有进行分析,因为我想对它进行完全匹配。< / p>

我将几个文件编入索引以供说明:

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"first_name":"Erik","last_name":"Smith","ssn":"111112222"}
{"index":{"_id":2}}
{"first_name":"Bob","last_name":"Jones","ssn":"123456789"}

然后我可以查询部分名称,并按照确切的ssn:

进行过滤
POST /test_index/doc/_search
{
   "query": {
      "filtered": {
         "query": {
            "match": {
               "_all": {
                   "query": "eri smi",
                   "operator": "and"
               }
            }
         },
         "filter": {
            "term": {
               "ssn": "111112222"
            }
         }
      }
   }
}

我回想起我所期待的:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.8838835,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.8838835,
            "_source": {
               "first_name": "Erik",
               "last_name": "Smith",
               "ssn": "111112222"
            }
         }
      ]
   }
}

如果您需要使用单个查询字符串(无过滤器)进行搜索,您也可以在ssn字段中添加all,但使用此设置时,它也会匹配部分字符串(如111112),这可能不是你想要的。

如果您只想匹配前缀(即从单词开头开始的搜索字词),则应使用edge ngrams

我写了一篇关于使用ngrams的博客文章,这可能对你有所帮助:http://blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch

以下是我用于此答案的代码。我尝试了一些不同的东西,包括我在此处发布的设置,以及ssn_all中的另一个,但是使用了边缘ngrams。希望这会有所帮助:

http://sense.qbox.io/gist/b6a31c929945ef96779c72c468303ea3bc87320f