Elasticsearch - River和nGrams

时间:2012-10-27 19:49:44

标签: database lucene couchdb elasticsearch n-gram

我正在使用带有插件的ES,因为我正在使用couchDB,而我正在尝试使用nGrams进行查询。 我基本上完成了我需要的一切,除了当有人输入空格时,查询无法正常工作。这是因为ES将查询的每个元素标记为按空格分割它。

以下是我需要做的事情:

  • 查询字符串中文本的一部分:

    查询:“Hello Wor”回复:“Hello World,Hello Word”/排除“Hello,World,Word”

  • 按照我指定的条件对结果进行排序;

  • 不区分大小写。

以下是我所做的事情:How to search for a part of a word with ElasticSearch

curl -X PUT  'localhost:9200/_river/myDB/_meta' -d '
{
"type" : "couchdb",
"couchdb" : {
    "host" : "localhost",
    "port" : 5984,
    "db" : "myDB",
    "filter" : null
},
   "index" : {
    "index" : "myDB",
    "type" : "myDB",
    "bulk_size" : "100",
    "bulk_timeout" : "10ms",
    "analysis" : {
               "index_analyzer" : {
                          "my_index_analyzer" : {
                                        "type" : "custom",
                                        "tokenizer" : "standard",
                                        "filter" : ["lowercase", "mynGram"]
                          }
               },
               "search_analyzer" : {
                          "my_search_analyzer" : {
                                        "type" : "custom",
                                        "tokenizer" : "standard",
                                        "filter" : ["standard", "lowercase", "mynGram"]
                          }
               },
               "filter" : {
                        "mynGram" : {
                                   "type" : "nGram",
                                   "min_gram" : 2,
                                   "max_gram" : 50
                        }
               }
    }
}
}
'

然后我会为排序添加一个映射:

curl -s -XGET 'localhost:9200/myDB/myDB/_mapping' 
{
"sorting": {
       "Title": {
            "fields": {
                "Title": {
                     "type": "string"
                  }, 
                "untouched": {
                    "include_in_all": false, 
                    "index": "not_analyzed", 
                    "type": "string"
                    }
               }, 
              "type": "multi_field"
         },
        "Year": {
              "fields": {
                   "Year": {
                       "type": "string"
                       }, 
                       "untouched": {
                           "include_in_all": false, 
                           "index": "not_analyzed", 
                           "type": "string"
                         }
                     }, 
                    "type": "multi_field"
        }
     }
    }
   }'

我已经添加了我用来完成的所有信息。 无论如何,通过这种设置,我认为应该可以工作,每当我尝试获得一些结果时,空间仍然用于分割我的查询,例如:

  http://localhost:9200/myDB/myDB/_search?q=Title:(Hello%20Wor)&pretty=true

返回包含“Hello”和“Wor”的任何内容(我通常不使用括号,但我在示例中看过它们,结果看起来仍然非常相似)。

任何帮助真正赞赏,因为这让我很烦恼。

更新: 最后,我意识到我不需要一个nGram。一个正常的指数会做;只需用'AND'替换查询的空格即可完成工作。

示例:

 Query: "Hello World"  --->  Replaced as "(*Hello And World*)"

1 个答案:

答案 0 :(得分:1)

现在没有弹性搜索设置,但也许这有助于doc?

http://www.elasticsearch.org/guide/reference/query-dsl/match-query.html

Types of Match Queries

boolean

The default match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator flag can be set to or or and to control the boolean clauses (defaults to or).

The analyzer can be set to control which analyzer will perform the analysis process on the text. It default to the field explicit mapping definition, or the default search analyzer.

fuzziness can be set to a value (depending on the relevant type, for string types it should be a value between 0.0 and 1.0) to constructs fuzzy queries for each term analyzed. The prefix_length and max_expansions can be set in this case to control the fuzzy process. If the fuzzy option is set the query will use constant_score_rewrite as its rewrite method the rewrite parameter allows to control how the query will get rewritten.

Here is an example when providing additional parameters (note the slight change in structure, message is the field name):

{
    "match" : {
        "message" : {
            "query" : "this is a test",
            "operator" : "and"
        }
    }
}