Question

我需要将内容与单词列表匹配（用于淫秽单词匹配）。作为我需要的一个简单例子。

{
  "bool": {
    "should": [
      { "term": { "content": "word1" }},
      { "term": { "content": "word2" }}
           :
      { "term": { "content": "word1001" }}
    ]
  }
}

我正在寻找的字词＆＃39; word1＆＃39;，＆＃39; word1＆＃39;，...＆＃39; word1001＆＃39;列在另一个类型的其他字段中。

我需要达到的目标是

{
  "bool": {
    "should": [
      { "term": { "content": banned_words.word }},
    ]
  }
}

我需要匹配的单词可能是数千，而上面的布尔值似乎不是最有效的。但是，我找不到替代方案。

Answer 1

在查询时没有匹配所有坏词的另一种方法是使用synonym token filter在索引时匹配这些词，并标记包含坏词的文档。

您所要做的就是将错误的单词存储在文件系统的文件中（在Elasticsearch主目录中）：

analysis/badwords.txt：

word1 => BADWORD      <--- pick whatever you want the badword to be replaced with
word2 => BADWORD
...
word1000 => BADWORD

然后，您的索引设置需要使用synonym令牌过滤器

curl -XPUT localhost:9200/my_index -d '{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "badwords" : {
                    "tokenizer" : "whitespace",
                    "filter" : ["synonym"]
                }
            },
            "filter" : {
                "synonym" : {
                    "type" : "synonym",
                    "synonyms_path" : "analysis/badwords.txt"
                }
            }
        }
    },
    "mappings": {
        "my_type": {
            "properties": {
                "content": {
                    "type": "string",
                    "index_analyzer": "badwords"
                }
            }
        }
    }
}'

然后，当您使用content字段对文档编制索引时，该字段包含与badwords.txt文件中的字符匹配的错误字词，它将被您在同义词文件中选择的替换字正确替换。< / p>

curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=badwords&pretty' -d 'you are a word2'
{
  "tokens" : [ {
    "token" : "you",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "are",
    "start_offset" : 4,
    "end_offset" : 7,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "a",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "BADWORD",
    "start_offset" : 10,
    "end_offset" : 14,
    "type" : "SYNONYM",
    "position" : 4
  } ]
}

Elasticsearch搜索另一个字段

1 个答案: