PUT /new_index/
{
"settings": {
"index": {
"type": "default"
},
"number_of_shards": 5,
"number_of_replicas": 1,
"analysis": {
"filter": {
"ap_stop": {
"type": "stop",
"stopwords_path": "stoplist.txt"
},
"shingle_filter" : {
"type" : "shingle",
"min_shingle_size" : 2,
"max_shingle_size" : 5,
"output_unigrams": true
}
},
"analyzer": {
"aplyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard",
"ap_stop",
"lowercase",
"shingle_filter",
"snowball"]
}
}
}
}
}
PUT /new_index/document/_mapping/
{
"document": {
"properties": {
"text": {
"type": "string",
"store": true,
"index": "analyzed",
"term_vector": "with_positions_offsets_payloads",
"search_analyzer": "aplyzer",
"index_analyzer": "aplyzer"
},
"original_text": {
"include_in_all": false,
"type": "string",
"store": false,
"index": "not_analyzed"
},
"docid": {
"include_in_all": false,
"type": "string",
"store": true,
"index": "not_analyzed"
}
}
}
}
我需要将上述索引settings
和mappings
转换为elastic4s
接受的类型。我使用的是最新的elastic4s
和elasticsearch 1.5.2
。
我查看了文档中给出的一些示例,但我无法弄清楚如何执行此操作,就像我尝试以这种方式创建它一样:
client.execute {
create index "new_index" mappings {
"documents" as (
"text" typed StringType analyzer ...
)
}
}
我无法弄清楚如何使用PUT请求中提供的store
,index
,term_vectors
等。
更新 基于答案,我能够做出这样的事情:
create index "new_index" shards 5 replicas 1 refreshInterval "90s" mappings {
"documents" as(
id typed StringType analyzer KeywordAnalyzer store true includeInAll false,
"docid" typed StringType index "not_analyzed" store true includeInAll false,
"original_text" typed StringType index "not_analyzed" includeInAll false,
"text" typed StringType analyzer CustomAnalyzer("aplyzer") indexAnalyzer "aplyzer" searchAnalyzer "aplyzer" store true termVector WithPositionsOffsetsPayloads
)
} analysis (
CustomAnalyzerDefinition(
"aplyzer",
StandardTokenizer,
LowercaseTokenFilter,
shingle tokenfilter "shingle_filter" minShingleSize 2 maxShingleSize 5 outputUnigrams true
)
)
我现在无法弄清楚如何添加雪球词干和停止文字路径到aplyzer
分析器?
我应该怎么做?
答案 0 :(得分:1)
您的标题询问了自定义过滤器,但您的问题正在询问store
,index
和term_vectors
。我将解释后者。
client.execute {
create index "myindex" mappings {
"mytype" as (
"myfield" typed StringType store true termVector termVector.WithOffsets index "not_analyzed"
)
)
}
}
<强>更新强>:
根据您更新的问题。如果您可以在雪球令牌过滤器上设置停用词,则不清楚elasticsearch文档。你可以在雪球分析仪上。
所以,
SnowballAnalyzerDefinition("mysnowball", "English", stopwords = Set("I", "he", "the"))
或
CustomAnalyzerDefinition("mysnowball",
StandardTokenizer,
LowercaseTokenFilter,
snowball tokenfilter "snowball1" language "German"
)
答案 1 :(得分:1)
基于@monkjack建议以及我从elastic4s
的文档中读到的内容,我终于想出了与elastic4s
一起使用时索引设置和映射的回答。浏览作者为API编写的tests。
create index "new_index" shards 5 replicas 1 refreshInterval "90s" mappings {
"documents" as(
id
typed StringType
analyzer KeywordAnalyzer
store true
includeInAll false,
"docid"
typed StringType
index "not_analyzed"
store true
includeInAll false,
"original_text"
typed StringType
index "not_analyzed"
includeInAll false,
"text"
typed StringType
analyzer CustomAnalyzer("aplyzer")
indexAnalyzer "aplyzer"
searchAnalyzer "aplyzer"
store true
termVector WithPositionsOffsetsPayloads
)
} analysis (
CustomAnalyzerDefinition(
"aplyzer",
StandardTokenizer,
LowercaseTokenFilter,
NamedStopTokenFilter("ap_stop", "_english_", true, true),
shingle
tokenfilter "shingle_filter"
minShingleSize 2
maxShingleSize 5
outputUnigrams true
outputUnigramsIfNoShingles true,
snowball
tokenfilter "ap_snowball"
lang "English"
)
)
如果您想提供自己的停用词列表,请使用StopTokenFilter("ap_stop", stopwords = Set("a", "an", "the"))
代替NamedStopTokenFilter
。
当我在Sense中运行GET new_index
时,我得到以下设置/映射。
{
"new_index": {
"aliases": {},
"mappings": {
"documents": {
"properties": {
"docid": {
"type": "string",
"index": "not_analyzed",
"store": true,
"include_in_all": false
},
"original_text": {
"type": "string",
"index": "not_analyzed",
"include_in_all": false
},
"text": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets_payloads",
"analyzer": "aplyzer"
}
}
}
},
"settings": {
"index": {
"creation_date": "1433383476240",
"uuid": "6PmqlY6FRPanGtVSsGy3Jw",
"analysis": {
"analyzer": {
"aplyzer": {
"type": "custom",
"filter": [
"lowercase",
"ap_stop",
"shingle_filter",
"ap_snowball"
],
"tokenizer": "standard"
}
},
"filter": {
"ap_stop": {
"enable_position_increments": "true",
"ignore_case": "true",
"type": "stop",
"stopwords": "_english_"
},
"shingle_filter": {
"output_unigrams_if_no_shingles": "true",
"token_separator": " ",
"max_shingle_size": "5",
"type": "shingle",
"min_shingle_size": "2",
"filler_token": "_",
"output_unigrams": "true"
},
"ap_snowball": {
"type": "snowball",
"language": "English"
}
}
},
"number_of_replicas": "1",
"number_of_shards": "5",
"refresh_interval": "90s",
"version": {
"created": "1050299"
}
}
},
"warmers": {}
}
}
如果您希望将StopWords
和Stemmers
作为单独的分析工具,则@monkjack建议只添加SnowballAnalyzerDefinition
和StopAnalyzerDefinition
,如:
....outputUnigramsIfNoShingles true,
),
SnowballAnalyzerDefinition("ap_snowball", "English"),
StopAnalyzerDefinition("ap_stop", stopwords = Set("a", "an", "the"))
)