Question

我有一个字段name的以下映射，其中包含电子商务的产品名称。

   'properties': {
       'name': {
           'type': 'text',
           'analyzer': 'standard',
           'fields': {
                'english': {
                'type': 'text',
                'analyzer': 'english'
            },
        }
    },

假设我有以下字符串要编入索引/搜索。

一包3件T恤

两位分析师分别生成术语[t，shirt]，[t，shirt]。

这给我的问题是当用户输入“男士T恤”时没有得到任何结果

我怎样才能在[t，衬衫，衬衫，T恤，T恤]这样的倒排索引中得到这个词

我试图查看Stemmers排除，但我找不到任何处理连字符的事情。如果找到更通用的解决方案而不是手动排除，我也会很有帮助。因为现在我可能有许多可能性，例如emails, e-mails

Answer 1

我找到了一个解决方案，我想这可以帮助我达到预期的效果。但是，我仍然希望看到这个问题是否有一些好的和推荐的方法。

基本上我会使用多个字段来解决这个问题，其中第一个分析器将是标准分析器，第二个分析器将是我的自定义分析器。

根据Elasticsearch文档，length发生在chars_filters之前。因此，我们的想法是删除tokenizer一个空字符，这将使-变为t-shirts。因此，标记化器会将整个术语标记为反转索引中的tshirt。

tshirts

将提供以下令牌

GET _analyze
{
   "tokenizer": "standard",
   "filter": [
      "lowercase",
      {"type": "stop", "stopwords": "_english_"}
   ],
   "char_filter" : [
       "html_strip", 
       {"type": "mapping", "mappings": ["- => "]}
    ],
   "text": "these are t-shirts <table>"
}

Answer 2

whitespace tokenizer可以完成这项工作

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-tokenizer.html

POST _analyze
{
  "tokenizer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

将产生

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

允许基于hypen的单词在elasticsearch中被标记化

2 个答案: