我们已经使用word_delimiter过滤器对twitter样式标签进行了标记。对于标签'#SomeHashtag',我们希望用户可以搜索#SomeHashtag
或SomeHashtag
。
对于#SomeHashtag
,我们的分析器会生成以下标记:
#SomeHashtag
,Some
,SomeHashtag
,Hashtag
对于某些不同语言中常见的#Some_Hashtag
,我们的分析器会删除除原始标记之外的所有标记的下划线:
#Some_Hashtag
,Some
,SomeHashtag
,Hashtag
这是我们的分析仪:
"analysis": {
"analyzer": {
"tweet_test": {
"type": "custom",
"char_filter": ["html_strip", "quotes"],
"tokenizer": "standard_custom",
"filter": [ "custom_text_word_delimiter_query"]
}
},
"filter": {
"custom_text_word_delimiter_query": {
"type": "word_delimiter",
"generate_word_parts": "0",
"generate_number_parts": "0",
"catenate_words": "1",
"catenate_numbers": "1",
"catenate_all": "0",
"split_on_case_change": "0",
"split_on_numerics": "0",
"preserve_original": "0",
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"& => ALPHA",
"- => ALPHA",
". => ALPHA",
"/ => ALPHA",
"_ => ALPHA"
]
}
}
}
我们正在考虑的一个解决方案是附加的pattern_capture过滤器,例如:
"hashtag_filter": {
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : ["#([^\\s]*)"]
}
twitter使用的实际正则表达式比这长得多(参见https://github.com/twitter/twitter-text/blob/master/java/src/com/twitter/Regex.java)。
我们的应用程序可能每秒索引数百条消息,因此我们的问题是: