我正在Elasticsearch-6中分析的文本包含许多我不感兴趣的数字,但是我不知道如何删除它们。谢谢,我对令牌的搜索会带回邮政编码或时间或年份。几乎没有足够的不同年份可以将它们添加到停用词中。但是还有太多其他方法无法将其过滤掉。
我确实尝试编写自定义过滤器:
"char_filter": {
"number_filter": {
"type": "pattern_replace",
"pattern": "\\d+",
"replacement": " "
}
但是当我尝试将其添加到设置中时,出现以下错误:
由于缺少'。',因此无法获得[index.analysis.analyzer。]设置前缀的设置组和[index.analysis.analyzer.char_filter]的设置。
这是我的配置的整个设置部分(注意:在添加数字替换器之前它起作用了):
"settings": {
"analysis": {
"analyzer": {
"t_analyzer": {
"tokenizer": "t_tokenizer"
},
"major_words_analyzer": {
"type": "standard",
"stopwords": "_english_"
},
"char_filter": [
"number_filter"
]
},
"tokenizer": {
"t_tokenizer": {
"type": "standard"
}
},
"char_filter": {
"number_filter": {
"type": "pattern_replace",
"pattern": "\\d+",
"replacement": " "
}
}
}
}
编辑:这是相关的字段设置:
},
"narrative": {
"type": "text",
"store": "true",
"analyzer": "t_analyzer",
"fielddata": "true",
"fields": {
"raw": {
"type": "text"
}
}
},
"narrativePhrases": {
"type": "text",
"analyzer": "major_words_analyzer",
"fielddata": "true",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
编辑:我之后要做的是这个
POST /test_narrative/_search?size=0
{
"aggs": {
"incidents_by_month":{
"date_histogram":{
"field":"eventDate",
"interval":"month",
"min_doc_count" : 5
},
"aggs":{
"top_phrases":{
"significant_text": {
"field": "narrative",
"size": 10
}
}
}
}
}
}
而且我在返回值中仍然有数字:
{
"key": "personally",
"doc_count": 3,
"score": 5.22625236294896,
"bg_count": 36
},
{
"key": "2011",
"doc_count": 4,
"score": 2.4786045712321703,
"bg_count": 132
}
答案 0 :(得分:0)
您似乎在上述设置中未正确放置char_filter
。
根据此documentation,char_filter
是您要创建的custom analyzer
的参数之一,因此,它必须位于t_analyzer
和/或major_words_analyzer
取决于您的要求。例如。
"t_analyzer":{
"tokenizer":"t_tokenizer",
"char_filter":[
"number_filter"
]
}
如果您打算在两个分析仪上都使用char_filter
,则意味着您的设置必须采用以下方式。
PUT numberindex
{
"settings":{
"analysis":{
"analyzer":{
"t_analyzer":{
"tokenizer":"t_tokenizer",
"char_filter":[
"number_filter"
]
},
"major_words_analyzer":{
"type":"standard",
"stopwords":"_english_",
"char_filter":[
"number_filter"
]
}
},
"tokenizer":{
"t_tokenizer":{
"type":"standard"
}
},
"char_filter":{
"number_filter":{
"type":"pattern_replace",
"pattern":"\\d+",
"replacement":""
}
}
}
}
}
希望有帮助!