我的文档字段中包含大量纯文本,其中有一些符号用于货币。如何将这些更改为相应的名称,如$ to dollar等?
答案 0 :(得分:1)
您可以通过创建一个mapping char filter的自定义分析器来实现此目的,您可以在其中指定要替换哪个字符的其他字符:
curl -XPUT localhost:9200/my_index -d '{
"settings": {
"analysis": {
"char_filter": {
"currencies": {
"type": "mapping",
"mappings": [
"$=>USD" <--- define your currency mappings here
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"currencies"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
然后,如果您将You owe me $ 100
之类的句子编入索引,那么将生成的标记如下:
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=my_analyzer&pretty' -d 'You owe me $ 100'
{
"tokens" : [ {
"token" : "You",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "owe",
"start_offset" : 4,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "me",
"start_offset" : 8,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "USD",
"start_offset" : 11,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "100",
"start_offset" : 13,
"end_offset" : 16,
"type" : "<NUM>",
"position" : 5
} ]
}
如您所见,$
符号已被字符串USD
取代。