如何过滤simple_query_string / query_string查询

时间:2019-10-31 21:58:10

标签: elasticsearch elasticsearch-percolate

索引:

{
    "settings": {
        "index.percolator.map_unmapped_fields_as_text": true,
    },
    "mappings": {
        "properties": {
            "query": {
                "type": "percolator"
            }
        }
    }
}

此测试过滤器查询有效

{
    "query": {
        "match": {
            "message": "blah"
        }
    }
}

此查询无效

{
    "query": {
        "simple_query_string": {
            "query": "bl*"
        }
    }
}

结果:

{"took":15,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.13076457,"hits":[{"_index":"my-index","_type":"_doc","_id":"1","_score":0.13076457,"_source":{"query":{"match":{"message":"blah"}}},"fields":{"_percolator_document_slot":[0]}}]}}

为什么这个simple_query_string查询与文档不匹配?

1 个答案:

答案 0 :(得分:3)

我也不明白你在问什么。可能是您不太了解渗滤器? 这是我现在刚刚尝试的示例。

让我们假设您有一个索引-称为test-您要在其中索引某些文档。该索引具有以下映射(只是我在测试设置中拥有的随机测试索引):

{  
    "settings": {
        "analysis": {
          "filter": {
            "email": {
              "type": "pattern_capture",
              "preserve_original": true,
              "patterns": [
                "([^@]+)",
                "(\\p{L}+)",
                "(\\d+)",
                "@(.+)",
                "([^-@]+)"
              ]
            }
          },
          "analyzer": {
            "email": {
              "tokenizer": "uax_url_email",
              "filter": [
                "email",
                "lowercase",
                "unique"
              ]
            }
          }
        }
      },
    "mappings": {
        "properties": {
            "code": {
                "type": "long"
            },
            "date": {
                "type": "date"
            },
            "part": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "val": {
                "type": "long"
            },
            "email": {
              "type": "text",
              "analyzer": "email"
            }
        }
    }
}

您会注意到它有一个自定义的email分析器,该分析器将类似foo@bar.com的内容拆分为以下令牌:foo@bar.comfoobar.com,{{1} },bar

如文档所述,您可以创建一个单独的过滤器索引,该索引将仅容纳您的过滤器查询,而不包含文档本身。而且,即使percolator索引本身不包含文档,它也应该包含应该保存文档的索引的映射(在我们的例子中为com)。

这是过滤器索引(我称之为test)的映射,该索引也具有用于拆分percolator_index字段的特殊分析器:

email

它的映射和设置与我的原始索引几乎相同,唯一的区别是添加到映射中的{ "settings": { "analysis": { "filter": { "email": { "type": "pattern_capture", "preserve_original": true, "patterns": [ "([^@]+)", "(\\p{L}+)", "(\\d+)", "@(.+)", "([^-@]+)" ] } }, "analyzer": { "email": { "tokenizer": "uax_url_email", "filter": [ "email", "lowercase", "unique" ] } } } }, "mappings": { "properties": { "query": { "type": "percolator" }, "code": { "type": "long" }, "date": { "type": "date" }, "part": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "val": { "type": "long" }, "email": { "type": "text", "analyzer": "email" } } } } 类型的附加query字段。

您感兴趣的查询-percolator-应该进入simple_query_string内的文档中。像这样:

percolator_index

为了使它更有趣,我在其中添加了PUT /percolator_index/_doc/1?refresh { "query": { "simple_query_string" : { "query" : "month foo@bar.com", "fields": ["part", "email"] } } } 字段,以便在查询中进行专门搜索(默认情况下,将搜索所有字段)。

现在,我们的目标是针对渗透过滤器索引中的email查询来测试最终应进入test索引的文档。例如:

simple_query_string

显然,GET /percolator_index/_search { "query": { "percolate": { "field": "query", "document": { "date":"2004-07-31T11:57:52.000Z","part":"month","code":109,"val":0,"email":"foo@bar.com" } } } } 下的内容是您将来的文档(尚不存在)。这将与上面定义的document相匹配,并且将导致匹配:

simple_query_string

如果我要对这份文件进行渗透怎么办:

{
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.39324823,
        "hits": [
            {
                "_index": "percolator_index",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.39324823,
                "_source": {
                    "query": {
                        "simple_query_string": {
                            "query": "month foo@bar.com",
                            "fields": [
                                "part",
                                "email"
                            ]
                        }
                    }
                },
                "fields": {
                    "_percolator_document_slot": [
                        0
                    ]
                }
            }
        ]
    }
}

(请注意,电子邮件仅为{ "query": { "percolate": { "field": "query", "document": { "date":"2004-07-31T11:57:52.000Z","part":"month","code":109,"val":0,"email":"foo" } } } } ) 结果是:

foo

请注意,分数略低于第一个经过过滤的文档。大概是这样的,因为{ "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 0.26152915, "hits": [ { "_index": "percolator_index", "_type": "_doc", "_id": "1", "_score": 0.26152915, "_source": { "query": { "simple_query_string": { "query": "month foo@bar.com", "fields": [ "part", "email" ] } } }, "fields": { "_percolator_document_slot": [ 0 ] } } ] } } (我的电子邮件)仅与我分析的foo中的一个词相匹配,而foo@bar.com会与所有它们匹配(因此得分更高) / p>

不确定您在说什么分析器。我认为上面的示例涵盖了唯一的“分析器”问题/未知,我认为这可能有点令人困惑。