Question

我们索引HTML文档，其中可能包含指向其他文档的链接。我们正在使用弹性搜索，大多数关键字搜索都非常流畅，这很棒。

现在，我们正在添加类似于Google site:或link:搜索的更复杂搜索：基本上我们想要检索指向特定网址甚至域名的文档。（如果文档A包含指向http://a.site.tld/path/的链接，则搜索link:http://a.site.tld应该会产生该链接。）。

我们现在正在尝试实现这一目标的最佳方法。到目前为止，我们已经从文档中提取了链接，并在我们的文档中添加了links字段。我们将links设置为不进行分析。然后我们可以进行与确切网址匹配的搜索link:http://a.site.tld/path/但当然link:http://a.site.tld不会产生任何结果。

我们最初的想法是创建一个类似的新字段linkedDomains ......但是可能存在更好的解决方案？

Answer 1

您可以尝试Path Hierarchy Tokenizer：

按如下方式定义映射：

PUT /link-demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "path-analyzer": {
          "type": "custom",
          "tokenizer": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "doc": {      
      "properties": {
        "link": {
          "type": "string",
          "index_analyzer": "path-analyzer"          
        }
      }
    }
  }
}

索引文档：

POST /link-demo/doc
{
    link: "http://a.site.tld/path/"
}

以下术语查询返回索引文档：

POST /link-demo/_search?pretty
{
    "query": {
        "term": {
           "link": {
              "value": "http://a.site.tld"
           }
        }
    }
}

要了解如何编制索引：

GET link-demo/_analyze?analyzer=path-analyzer&text="http://a.site.tld/path"&pretty

显示以下内容：

{
  "tokens" : [ {
    "token" : "\"http:",
    "start_offset" : 0,
    "end_offset" : 6,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "\"http:/",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "\"http://a.site.tld",
    "start_offset" : 0,
    "end_offset" : 18,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "\"http://a.site.tld/path\"",
    "start_offset" : 0,
    "end_offset" : 24,
    "type" : "word",
    "position" : 1
  } ]
}

Elasticsearch，在网址中搜索域名

1 个答案: