弹性搜索自定义分析器的特定字符

时间:2015-10-05 15:03:08

标签: elasticsearch

如何创建自定义分析器,通过' /'来标记字段。只有字符。

我的字段中包含url字符串:" https://stackoverflow.com/questions/ask" 我希望将其标记为:" http"," stackoverflow.com","问题"和"问"

2 个答案:

答案 0 :(得分:1)

使用pattern tokenizer

,这似乎可以满足您的需求
PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "analyzer": {
            "slash_analyzer": {
               "type": "pattern",
               "pattern": "[/:]+",
               "lowercase": true
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "url": {
               "type": "string",
               "index_analyzer": "slash_analyzer",
               "search_analyzer": "standard",
               "term_vector": "yes"
            }
         }
      }
   }
}

PUT /test_index/doc/1
{
   "url": "http://stackoverflow.com/questions/ask"
}

我在映射中添加了term vectors(您可能不希望在生产中执行此操作),因此我们可以看到生成了哪些术语:

GET /test_index/doc/1/_termvector
...
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "took": 1,
   "term_vectors": {
      "url": {
         "field_statistics": {
            "sum_doc_freq": 4,
            "doc_count": 1,
            "sum_ttf": 4
         },
         "terms": {
            "ask": {
               "term_freq": 1
            },
            "http": {
               "term_freq": 1
            },
            "questions": {
               "term_freq": 1
            },
            "stackoverflow.com": {
               "term_freq": 1
            }
         }
      }
   }
}

这是我使用的代码:

http://sense.qbox.io/gist/669fbdd681895d7e9f8db13799865c6e8be75b11

答案 1 :(得分:0)

标准分析仪已经为您做到了。

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'http://stackoverflow.com/questions/ask'

你明白了:

{
  "tokens" : [ {
    "token" : "http",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "stackoverflow.com",
    "start_offset" : 7,
    "end_offset" : 24,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "questions",
    "start_offset" : 25,
    "end_offset" : 34,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "ask",
    "start_offset" : 35,
    "end_offset" : 38,
    "type" : "<ALPHANUM>",
    "position" : 4
  } ]
}