如何使用elastic4s为索引编写映射/设置?

时间:2015-06-03 13:51:08

标签: scala elasticsearch elastic4s

PUT /new_index/
{
    "settings": {
        "index": {
            "type": "default"
        },
        "number_of_shards": 5,
        "number_of_replicas": 1,
        "analysis": {
            "filter": {
                "ap_stop": {
                    "type": "stop",
                    "stopwords_path": "stoplist.txt"
                },
                "shingle_filter" : {
                    "type" : "shingle",
                    "min_shingle_size" : 2,
                    "max_shingle_size" : 5,
                    "output_unigrams": true
                }
            },
        "analyzer": {
             "aplyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["standard",
                           "ap_stop",
                           "lowercase",
                           "shingle_filter",
                           "snowball"]
                }
            }
        }
    }
}

PUT /new_index/document/_mapping/
{
    "document": {
        "properties": {
            "text": {
                "type": "string",
                "store": true,
                "index": "analyzed",
                "term_vector": "with_positions_offsets_payloads",
                "search_analyzer": "aplyzer",
                "index_analyzer": "aplyzer"
            },
            "original_text": {
                "include_in_all": false,
                "type": "string",
                "store": false,
                "index": "not_analyzed"
            },
            "docid": {
                "include_in_all": false,
                "type": "string",
                "store": true,
                "index": "not_analyzed"  
            }
        }
    }
}

我需要将上述索引settingsmappings转换为elastic4s接受的类型。我使用的是最新的elastic4selasticsearch 1.5.2

我查看了文档中给出的一些示例,但我无法弄清楚如何执行此操作,就像我尝试以这种方式创建它一样:

client.execute {
    create index "new_index" mappings {
      "documents" as (
        "text" typed StringType analyzer ...
        )
    }
  }

我无法弄清楚如何使用PUT请求中提供的storeindexterm_vectors等。

更新 基于答案,我能够做出这样的事情:

create index "new_index" shards 5 replicas 1 refreshInterval "90s"  mappings {
    "documents" as(
      id typed StringType analyzer KeywordAnalyzer store true includeInAll false,
      "docid" typed StringType index "not_analyzed" store true includeInAll false,
      "original_text" typed StringType index "not_analyzed" includeInAll false,
      "text" typed StringType analyzer CustomAnalyzer("aplyzer") indexAnalyzer "aplyzer" searchAnalyzer "aplyzer" store true termVector WithPositionsOffsetsPayloads
      )
  } analysis (
    CustomAnalyzerDefinition(
      "aplyzer",
      StandardTokenizer,
      LowercaseTokenFilter,
      shingle tokenfilter "shingle_filter" minShingleSize 2 maxShingleSize 5 outputUnigrams true
    )
  )

我现在无法弄清楚如何添加雪球词干和停止文字路径到aplyzer分析器?

我应该怎么做?

2 个答案:

答案 0 :(得分:1)

您的标题询问了自定义过滤器,但您的问题正在询问storeindexterm_vectors。我将解释后者。

  client.execute {
    create index "myindex" mappings {
      "mytype" as (
        "myfield" typed StringType store true termVector termVector.WithOffsets index "not_analyzed"
        )
      )
    }
  }

<强>更新

根据您更新的问题。如果您可以在雪球令牌过滤器上设置停用词,则不清楚elasticsearch文档。你可以在雪球分析仪上。

所以,

SnowballAnalyzerDefinition("mysnowball", "English", stopwords = Set("I", "he", "the"))

CustomAnalyzerDefinition("mysnowball",
  StandardTokenizer,
  LowercaseTokenFilter,
  snowball tokenfilter "snowball1" language "German"
)

答案 1 :(得分:1)

基于@monkjack建议以及我从elastic4s的文档中读到的内容,我终于想出了与elastic4s一起使用时索引设置和映射的回答。浏览作者为API编写的tests

create index "new_index" shards 5 replicas 1 refreshInterval "90s" mappings {
    "documents" as(
      id
        typed StringType
        analyzer KeywordAnalyzer
        store true
        includeInAll false,
      "docid"
        typed StringType
        index "not_analyzed"
        store true
        includeInAll false,
      "original_text"
        typed StringType
        index "not_analyzed"
        includeInAll false,
      "text"
        typed StringType
        analyzer CustomAnalyzer("aplyzer")
        indexAnalyzer "aplyzer"
        searchAnalyzer "aplyzer"
        store true
        termVector WithPositionsOffsetsPayloads
      )
  } analysis (
    CustomAnalyzerDefinition(
      "aplyzer",
      StandardTokenizer,
      LowercaseTokenFilter,
      NamedStopTokenFilter("ap_stop", "_english_", true, true),
      shingle
        tokenfilter "shingle_filter"
        minShingleSize 2
        maxShingleSize 5
        outputUnigrams true
        outputUnigramsIfNoShingles true,
      snowball
        tokenfilter "ap_snowball"
        lang "English"
    )
  )

如果您想提供自己的停用词列表,请使用StopTokenFilter("ap_stop", stopwords = Set("a", "an", "the"))代替NamedStopTokenFilter

当我在Sense中运行GET new_index时,我得到以下设置/映射。

{
   "new_index": {
      "aliases": {},
      "mappings": {
         "documents": {
            "properties": {
               "docid": {
                  "type": "string",
                  "index": "not_analyzed",
                  "store": true,
                  "include_in_all": false
               },
               "original_text": {
                  "type": "string",
                  "index": "not_analyzed",
                  "include_in_all": false
               },
               "text": {
                  "type": "string",
                  "store": true,
                  "term_vector": "with_positions_offsets_payloads",
                  "analyzer": "aplyzer"
               }
            }
         }
      },
      "settings": {
         "index": {
            "creation_date": "1433383476240",
            "uuid": "6PmqlY6FRPanGtVSsGy3Jw",
            "analysis": {
               "analyzer": {
                  "aplyzer": {
                     "type": "custom",
                     "filter": [
                        "lowercase",
                        "ap_stop",
                        "shingle_filter",
                        "ap_snowball"
                     ],
                     "tokenizer": "standard"
                  }
               },
               "filter": {
                  "ap_stop": {
                     "enable_position_increments": "true",
                     "ignore_case": "true",
                     "type": "stop",
                     "stopwords": "_english_"
                  },
                  "shingle_filter": {
                     "output_unigrams_if_no_shingles": "true",
                     "token_separator": " ",
                     "max_shingle_size": "5",
                     "type": "shingle",
                     "min_shingle_size": "2",
                     "filler_token": "_",
                     "output_unigrams": "true"
                  },
                  "ap_snowball": {
                     "type": "snowball",
                     "language": "English"
                  }
               }
            },
            "number_of_replicas": "1",
            "number_of_shards": "5",
            "refresh_interval": "90s",
            "version": {
               "created": "1050299"
            }
         }
      },
      "warmers": {}
   }
}

如果您希望将StopWordsStemmers作为单独的分析工具,则@monkjack建议只添加SnowballAnalyzerDefinitionStopAnalyzerDefinition,如:

....outputUnigramsIfNoShingles true,
    ),
    SnowballAnalyzerDefinition("ap_snowball", "English"),
    StopAnalyzerDefinition("ap_stop", stopwords = Set("a", "an", "the"))
  )