Question

是否可以将stopwords.txt上传到aws elasticsearch并通过停止令牌过滤器将其指定为路径？

感谢

Answer 1

如果您使用aws elasticsearch，则执行此操作的唯一选择是使用elasticsearch rest API。

要导入大型数据集，您可以使用批量API。

Answer 2

是的，可以通过在定义停止令牌过滤器时设置stopwords_path来实现。

stopwords_path =＆gt;路径（相对于配置位置，或绝对）到一个停用词文件配置。每个停止词应该是在它自己的“线”（由换行符分隔）。该文件必须是UTF-8 编码。

我是这样做的。

在我的elasticsearch主路径的config文件夹中复制了stopwords.txt文件。
使用stopwords_path中设置的路径创建自定义令牌过滤器 PUT /testindex { "settings": { "analysis": { "filter": { "teststopper": { "type": "stop", "stopwords_path": "stopwords.txt" } } } } }
使用_analyze API验证过滤器是否按预期工作。 GET testindex/_analyze { "tokenizer" : "standard", "token_filters" : ["teststopper"], "text" : "this is a text to test the stop filter", "explain" : true, "attributes" : ["keyword"] }
因为我已将这些标记添加到config / stopwords.txt文件中，所以标记'a'，'an'，'the'，'to'，'is'已被过滤掉。

欲了解更多信息：

Answer 3

否，无法将stopwords.txt文件上传到托管的AWS Elasticsearch服务。

您需要做的是在自定义分析器中指定停用词。有关如何执行此操作的更多详细信息，请参见official documentation。

然后，官方文档说要“关闭并重新打开”索引，但是同样，AWS Elasticsearch不允许这样做，因此您必须重新索引。

示例：

1。使用自定义分析器内嵌列出的停用词来创建索引，例如

    PUT /my_new_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "english_analyzer": {
              "type": "english", 
              "stopwords": "['a', 'the', 'they', 'and']" 
            }
          }
        }
      }
    }

2。重新编制索引

    POST _reindex
    {
      "source": {
        "index": "my_index"
      },
      "dest": {
        "index": "my_new_index"
      }
    }

如何将大量的停用词上传到AWS Elasticsearch

3 个答案: