使用分析器进行错误的索引弹性搜索

时间:2016-05-13 20:04:23

标签: elasticsearch indexing lucene

我做了一个非常简单的测试。我构建了一个学生索引和一个类型,然后我定义了一个映射:

POST student
{

    "mappings" : {
        "ing3" : {
            "properties" : {
                "quote": {
                  "type": "string",
                  "analyzer": "english"
                }
            }
        }
    }
}

之后我将3名学生添加到此索引中:

POST /student/ing3/1
{
  "name": "Smith",
  "first_name" : "John",
  "quote" : "Learning is so cool!!"
}

POST /student/ing3/2
{
  "name": "Roosevelt",
  "first_name" : "Franklin",
  "quote" : "I learn everyday"
}

POST /student/ing3/3
{
  "name": "Black",
  "first_name" : "Mike",
  "quote" : "I learned a lot at school"
}

此时我认为英语标记符会将我的引号中的所有单词标记为所以如果我正在进行搜索:

GET /etudiant/ing3/_search
{
    "query" : {
        "term" : { "quote" : "learn" }
    }
}

我将把所有文件作为结果,因为我的标记器将平等地学习,学习,学习和#34;我是对的。但是当我尝试这个请求时:

GET /student/ing3/_search
{
    "query" : {
        "term" : { "quote" : "learned" }
    }
}

我没有受到打击,在我看来我应该有第三份文件(至少?)。但对我来说,Elasticsearch也应该learnedlearning不只是learn。我错了吗?我的要求错了吗?

2 个答案:

答案 0 :(得分:1)

如果你检查:

GET 'index/_analyze?field=quote' -d "I learned a lot at school"

你会看到你的句子被分析为:

{
   "tokens":[
      {
         "token":"i",
         "start_offset":0,
         "end_offset":1,
         "type":"<ALPHANUM>",
         "position":0
      },
      {
         "token":"learn",
         "start_offset":2,
         "end_offset":9,
         "type":"<ALPHANUM>",
         "position":1
      },
      {
         "token":"lot",
         "start_offset":12,
         "end_offset":15,
         "type":"<ALPHANUM>",
         "position":3
      },
      {
         "token":"school",
         "start_offset":19,
         "end_offset":25,
         "type":"<ALPHANUM>",
         "position":5
      }
   ]
}

因此,英语分析器会删除功能并停用单词并以其根形式对单词进行标记。

https://www.elastic.co/guide/en/elasticsearch/guide/current/using-language-analyzers.html

您可以使用match查询,该查询也会分析您的搜索文本,以便匹配:

GET /etudiant/ing3/_search
{
    "query" : {
        "match" : { "quote" : "learned" }
    }
}

答案 1 :(得分:1)

还有另一种方式。您可以两者来阻止术语(english分析器确实有一个词干分析器),但也可以使用keyword_repeat token filter然后使用{{3}保留原始术语使用"only_on_same_position": true删除不必要的重复项:

PUT student
{
  "settings": {
    "analysis": {
      "analyzer": {
        "myAnalyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "keyword_repeat",
            "english_stemmer",
            "unique_stem"
          ]
        }
      },
      "filter": {
        "unique_stem": {
          "type": "unique",
          "only_on_same_position": true
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      }
    }
  },
  "mappings": {
    "ing3": {
      "properties": {
        "quote": {
          "type": "string",
          "analyzer": "myAnalyzer"
        }
      }
    }
  }
}

在这种情况下,term查询也可以。如果你看一下实际被索引的条款:

GET /student/_search
{
  "fielddata_fields": ["quote"]
}

现在很清楚为什么它匹配:

  "hits": [
     {
        "_index": "student",
        "_type": "ing3",
        "_id": "2",
        "_score": 1,
        "_source": {
           "name": "Roosevelt",
           "first_name": "Franklin",
           "quote": "I learn everyday"
        },
        "fields": {
           "quote": [
              "everydai",
              "everyday",
              "i",
              "learn"
           ]
        }
     },
     {
        "_index": "student",
        "_type": "ing3",
        "_id": "1",
        "_score": 1,
        "_source": {
           "name": "Smith",
           "first_name": "John",
           "quote": "Learning is so cool!!"
        },
        "fields": {
           "quote": [
              "cool",
              "learn",
              "learning",
              "so"
           ]
        }
     },
     {
        "_index": "student",
        "_type": "ing3",
        "_id": "3",
        "_score": 1,
        "_source": {
           "name": "Black",
           "first_name": "Mike",
           "quote": "I learned a lot at school"
        },
        "fields": {
           "quote": [
              "i",
              "learn",
              "learned",
              "lot",
              "school"
           ]
        }
     }
  ]