如何修改标准分析仪包含#?

时间:2016-01-12 21:25:15

标签: elasticsearch analyzer

某些字符被视为#等分隔符,因此它们在查询中永远不会匹配。什么应该是最接近标准的自定义分析器配置,以允许这些字符匹配?

1 个答案:

答案 0 :(得分:2)

1)最简单的方法是将whitespace tokenizerlowercase filter一起使用。

override func viewDidLoad() {
super.viewDidLoad()

let longPressingGesture = UILongPressGestureRecognizer(target: self, action: "addPinsOnMaps:")

longPressingGesture.minimumPressDuration = 1.2

mapView.addGestureRecognizer(longPressingGesture)

}

func addPinsOnMaps(gesturePressing: UIGestureRecognizer){

    let touchPoint = gesturePressing.locationInView(self.mapView)

    mapView.convertPoint(touchPoint, toCoordinateFromView: self.mapView)

    let annotation = MKPointAnnotation()
    annotation.title = "This Place"
    annotation.subtitle = "Gonna stay here for a while"
    annotation.coordinate = coordinates
    mapView.addAnnotation(annotation)
}

会给你

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase&pretty' -d 'new year #celebration vegas'

2)如果您只想保留一些特殊字符,则可以使用char filter映射它们,以便在{ "tokens" : [ { "token" : "new", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 1 }, { "token" : "year", "start_offset" : 4, "end_offset" : 8, "type" : "word", "position" : 2 }, { "token" : "#celebration", "start_offset" : 9, "end_offset" : 21, "type" : "word", "position" : 3 }, { "token" : "vegas", "start_offset" : 22, "end_offset" : 27, "type" : "word", "position" : 4 } ] } 发生之前将文本转换为其他内容。这更接近tokenization。例如,您可以像这样创建索引

standard analyzer

现在PUT my_index { "settings": { "analysis": { "analyzer": { "special_analyzer": { "char_filter": [ "special_mapping" ], "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ] } }, "char_filter": { "special_mapping": { "type": "mapping", "mappings": [ "#=>hashtag\\u0020" ] } } } }, "mappings": { "my_type": { "properties": { "tweet": { "type": "string", "analyzer": "special_analyzer" } } } } } 自定义分析器将生成以下标记

curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=special_analyzer&pretty' -d 'new year #celebration vegas'

所以你可以像这样搜索

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "hashtag",
    "start_offset" : 9,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "celebration",
    "start_offset" : 10,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 5
  } ]
}

您还可以只搜索庆祝活动,因为我使用了unicode空格GET my_index/_search { "query": { "match": { "tweet": "#celebration" } } } ,否则我们将始终使用\\u0020进行搜索

希望这会有所帮助!!