在关键字分析字段上应用html_strip和小写过滤器

时间:2016-08-20 13:33:33

标签: elasticsearch

我尝试在关键字分析字段上应用html_strip和小写过滤器。搜索时我注意到搜索结果不符合预期。

这是我们尝试创建的索引

PUT /test_index
{
  "settings": {
  "number_of_shards": 5,
  "number_of_replicas": 0,
  "analysis": {
    "analyzer": {
      "ExportPrimaryAnalyzer": {
        "type": "custom",
        "tokenizer": "whitespace",
        "filter": "lowercase",
        "char_filter": "html_strip"
      },
      "ExportRawAnalyzer": {
        "type": "custom",
        "buffer_size": "1000",
        "tokenizer": "keyword",
        "filter": "lowercase",
        "char_filter": "html_strip"
      }
    }
  }
}, 
  "mappings": {
    "test_type": {
      "properties": {
        "city": {
          "type": "string",
          "analyzer" : "ExportPrimaryAnalyzer"
        },
        "city_raw":{
          "type": "string",
          "analyzer" : "ExportRawAnalyzer"
        }
      }
    }
  }
}

以下是数据示例:

PUT test_index/test_type/4
{
  "city": "<p>I am from Pune</p>",
  "city_raw": "<p>I am from Pune</p>"
}

当我们尝试使用通配符时,我们没有得到结果。以下是我们试图解决的问题。

{
  "query": {
    "wildcard": {
      "city_raw": "i am*"
    }
  }
}

任何帮助表示赞赏

1 个答案:

答案 0 :(得分:0)

html_strip_filter会用new-lines替换html块元素。  因此,如果您使用keyword-tokenizer,则需要使用其他过滤器将new-lines替换为空字符串。

示例:

PUT test
{
   "settings": {
      "number_of_shards": 5,
      "number_of_replicas": 0,
      "analysis": {
         "char_filter": {
            "remove_new_line": {
               "type": "mapping",
               "mappings": [
                  "\\n =>"
               ]
            }
         },
         "analyzer": {
            "ExportPrimaryAnalyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase"
               ],
               "char_filter": [
                  "html_strip"
               ]
            },
            "ExportRawAnalyzer": {
               "type": "custom",
               "buffer_size": "1000",
               "tokenizer": "keyword",
               "filter": [
                  "lowercase"
               ],
               "char_filter": [
                  "html_strip",
                  "remove_new_line"
               ]
            }
         }
      }
   },
   "mappings": {
      "test_type": {
         "properties": {
            "city": {
               "type": "string",
               "analyzer": "ExportPrimaryAnalyzer"
            },
            "city_raw": {
               "type": "string",
               "analyzer": "ExportRawAnalyzer"
            }
         }
      }
   }
}

PUT test/test_type/4
{
  "city": "<p>I am from Bangalore I like Pune too</p>",
  "city_raw": "<p>I am from Bangalore I like Pune too</p>"
}

post test/_search
{
  "query": {
    "wildcard": {
      "city_raw": "i am *"
    }
  }
}

结果:

"hits": [
     {
        "_index": "test",
        "_type": "test_type",
        "_id": "4",
        "_score": 1,
        "_source": {
           "city": "<p>I am from Bangalore I like Pune too</p>",
           "city_raw": "<p>I am from Bangalore I like Pune too</p>"
        }
     }
  ]