Question

我有数百万个文档要索引。每个文档都有字段doc_id，doc_title和几个字段doc_content。

import requests

index = 'test'

JSON = {
    "mappings": {
        "properties": {
            "doc_id":      {"type": "keyword"},
            "doc_title":   {"type": "text"   },
            "doc_content": {"type": "text"   }
        }
    }
}

r = requests.put(f'http://127.0.0.1:9200/{index}', json=JSON)

为了最小化索引的大小，我将doc_title和doc_content分开。

docs = [
    {"doc_id": 1, "doc_title": "good"},
    {"doc_id": 1, "doc_content": "a"},
    {"doc_id": 1, "doc_content": "b"},

    {"doc_id": 2, "doc_title": "good"},
    {"doc_id": 2, "doc_content": "c"},
    {"doc_id": 2, "doc_content": "d"},

    {"doc_id": 3, "doc_title": "bad"},
    {"doc_id": 3, "doc_content": "a"},
    {"doc_id": 3, "doc_content": "e"}
]

for doc in docs:
    r = requests.post(f'http://127.0.0.1:9200/{index}/_doc', json=doc)

查询_1：

JSON = {
    "query": {
        "match": {
            "doc_title": "good"
        }
    }
}

r = requests.get(f'http://127.0.0.1:9200/{index}/_search', json=JSON)

[x['_source'] for x in r.json()['hits']['hits']]

[{'doc_id'：1，'doc_title'：'good'}，{'doc_id'：2，'doc_title'：'good'}]

查询_2：

JSON = {
    "query": {
        "match": {
            "doc_content": "a"
        }
    }
}

r = requests.get(f'http://127.0.0.1:9200/{index}/_search', json=JSON)

[x['_source'] for x in r.json()['hits']['hits']]

[{'doc_id'：1，'doc_content'：'a'}，{'doc_id'：3，'doc_content'：'a'}]

如何组合query_1和query_2？

我需要这样的东西：

JSON = {
    "query": {
        "bool": {
            "must": [
                {"match": {"doc_title": "good"}},
                {"match": {"doc_content": "a"}}
            ]
        }
    }
}

r = requests.get(f'http://127.0.0.1:9200/{index}/_search', json=JSON)

[x['_source'] for x in r.json()['hits']['hits']]

[]

所需结果：

[{'doc_id'：1，'doc_title'：'good'，'doc_content'：'a'}]

Answer 1

将doc_title和doc_content分开是不好的做法-您并没有真正减少任何东西。

继续：

docs = [
    {"doc_id": 1, "doc_title": "good", "doc_content": ["a", "b"]},
    {"doc_id": 2, "doc_title": "good", "doc_content": ["c", "d"]},
    {"doc_id": 3, "doc_title": "bad", "doc_content": ["a", "e"]}
]

for doc in docs:
    r = requests.post(f'http://127.0.0.1:9200/{index}/_doc', json=doc)

，您的查询将按预期运行。无论如何，a和b应该由doc_id=1共享，不是吗？

更新-语法上使contents nested

PUT test
{
  "mappings": {
      "properties": {
        "contents": {
          "type": "nested",
          "properties": {
            "doc_content": {
              "type": "text"
            }
          }
        },
        "doc_id": {
          "type": "keyword"
        },
        "doc_title": {
          "type": "text"
        }
      }

  }
}

POST test/_doc
{
  "doc_id": 1,
  "doc_title": "good",
  "contents": [
    {"doc_content": "a"},
    {"doc_content": "b"}
  ]
}

GET test/_search
{
  "_source": ["doc_title", "inner_hits"], 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "doc_title": "good"
          }
        },
        {
          "nested": {
            "path": "contents",
            "query": {
              "match": {
                "contents.doc_content": "a"
              }
            },
            "inner_hits": {}
          }
        }
      ]
    }
  }
}

屈服

[
  {
    "_index":"test",
    "_type":"_doc",
    "_id":"sySOoXEBdiyDG0RsIq21",
    "_score":0.98082924,
    "_source":{
      "doc_title":"good"               <------
    },
    "inner_hits":{
      "contents":{
        "hits":{
          "total":1,
          "max_score":0.6931472,
          "hits":[
            {
              "_index":"test",
              "_type":"_doc",
              "_id":"sySOoXEBdiyDG0RsIq21",
              "_nested":{
                "field":"contents",
                "offset":0
              },
              "_score":0.6931472,
              "_source":{
                "doc_content":"a"          <-----
              }
            }
          ]
        }
      }
    }
  }
]

如何合并多个查询？

1 个答案: