我正在尝试将自定义分析器添加到索引中,同时将该分析器映射到类型上的属性。这是我的JSON对象:
{ "settings" : {
"analysis" : {
"analyzer" : {
"test_analyzer" : {
"type" : "custom",
"tokenizer": "standard",
"filter" : ["lowercase", "asciifolding"],
"char_filter": ["html_strip"]
}
}
}
},
"mappings" : {
"test" : {
"properties" : {
"checkanalyzer" : {
"type" : "string",
"analyzer" : "test_analyzer"
}
}
}
}
}
我知道这个分析器有效,因为我使用/wp2/_analyze?analyzer=test_analyzer -d '<p>Testing analyzer.</p>'
进行了测试,当我检查/wp2/test/_mapping
时,它也显示为checkanalyzer属性的分析器。但是,如果我添加类似{"checkanalyzer": "<p>The tags should not show up</p>"}
的文档,则在使用_search
端点检索文档时,HTML标记不会被删除。我误解了映射是如何工作的还是我的JSON对象有问题?当我调用Elasticsearch时,我正在动态创建wp2索引以及测试类型,不确定这是否重要。
答案 0 :(得分:0)
html不会从源中删除,它会从该源生成的 terms 中删除。如果您使用terms aggregation:
,则可以看到此信息POST /test_index/_search
{
"aggs": {
"checkanalyzer_field_terms": {
"terms": {
"field": "checkanalyzer"
}
}
}
}
{
"took": 77,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "test",
"_id": "1",
"_score": 1,
"_source": {
"checkanalyzer": "<p>The tags should not show up</p>"
}
}
]
},
"aggregations": {
"checkanalyzer_field_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "not",
"doc_count": 1
},
{
"key": "should",
"doc_count": 1
},
{
"key": "show",
"doc_count": 1
},
{
"key": "tags",
"doc_count": 1
},
{
"key": "the",
"doc_count": 1
},
{
"key": "up",
"doc_count": 1
}
]
}
}
}
以下是我用来测试它的一些代码:
http://sense.qbox.io/gist/2971767aa0f5949510fa0669dad6729bbcdf8570
答案 1 :(得分:0)
现在,如果您想在索引和存储内容之前完全删除html,您可以使用映射器附件插件 - 在定义映射时,您可以将content_type分类为&#34; 。HTML&#34;
映射器附件对很多东西很有用,特别是如果你处理多种文档类型,但最值得注意的是 - 我相信只是为了剥离html标签而使用它就足够了(你不能用html_strip char做的事情)过滤器)。
只是预警 - 不会存储任何html标签。因此,如果你确实需要这些标签,我建议定义另一个字段来存储原始内容。另一个注意事项:您无法为映射器附件文档指定多字段,因此您需要将其存储在映射器附件文档之外。请参阅下面的工作示例。
您需要导致此映射:
{
"html5-es" : {
"aliases" : { },
"mappings" : {
"document" : {
"properties" : {
"delete" : {
"type" : "boolean"
},
"file" : {
"type" : "attachment",
"fields" : {
"content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"author" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "integer"
},
"language" : {
"type" : "string"
}
}
},
"hash_id" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"raw_content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "raw"
},
"title" : {
"type" : "string"
}
}
}
},
"settings" : { //insert your own settings here },
"warmers" : { }
}
}
这样在NEST中,我将这样组装内容:
Attachment attachment = new Attachment();
attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document"));
attachment.ContentType = "html";
Document document = new Document();
document.File = attachment;
document.RawContent = InsertRawContentFromString(originalText);
我在Sense中测试了这个 - 结果如下:
"file": {
"_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",
"_content_length": 0,
"_content_type": "html",
"_date": "0001-01-01T00:00:00",
"_title": "Topic10"
},
"delete": false,
"raw_content": "<h1>Topic10</h1><p>Delete this text and replace it with your own content. Check your mailbox.</p><p> </p><p>asdf</p><p> </p><p>10</p><p> </p><p>Lavender.</p><p> </p><p>10/6 12:03</p><p> </p><p>5 09</p><p> </p><p>11 47</p><p> </p><p>Halloween is in October.</p><p> </p><p>jog</p>"
},
"highlight": {
"file.content": [
"\n <em>Topic10</em>\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n \n\n asdf\n\n \n\n 10\n\n \n\n Lavender.\n\n \n\n 10/6 12:03\n\n \n\n 5 09\n\n \n\n 11 47\n\n \n\n Halloween is in October.\n\n \n\n jog\n\n "
]
}