Elasticsearch在查询结束时未返回预期结果

时间:2016-03-23 18:48:13

标签: elasticsearch highlighting n-gram

这是我的索引:

    [
        'index' => 'proof',
        'body' => [
            'settings' => [
                'analysis' => [
                    'tokenizer' => [
                        'ngram_tokenizer' => [
                            'type' => 'nGram',
                            'min_gram' => 1,
                            'max_gram' => 20,
                            'token_chars' => ['letter', 'digit'],
                        ],
                    ],
                    'analyzer' => [
                        'ngram_tokenizer_analyzer' => [
                            'type' => 'custom',
                            'tokenizer' => 'ngram_tokenizer',
                            'filter' => ['lowercase'],
                        ]
                    ]
                ]
            ],
            'mappings' => [
                'proof_page' => [
                    'properties' => [
                        'content' => [
                            'type' => 'multi_field',
                            'path' => 'just_name',
                            'fields' => [
                                'content' => [
                                    'type' => 'string',
                                    'analyzer' => 'ngram_tokenizer_analyzer',
                                ],
                                'untouched' => [
                                    'type' => 'string'
                                ]
                            ]
                        ],
                        'proof_name' => [
                            'type' => 'string',
                        ],
                        'project_name' => [
                            'type' => 'string',
                        ],
                        'page_number' => [
                            'type' => 'integer',
                            'index' => 'not_analyzed',
                        ],
                        'proof_id' => [
                            'type' => 'string',
                            'index' => 'not_analyzed',
                        ],
                        'project_id' => [
                            'type' => 'string',
                            'index' => 'not_analyzed',
                        ]
                    ]
                ]
            ]
        ]
    ]

这是一个示例查询:

[
    'index' => 'proof',
    'type' => 'proof_page',
    'body' => [
        'query' => [
            'filtered' => [
                'query' => [
                    'match_phrase' => [
                        'content' => [
                            'query' => 'Lorem Ipsum is simply dum',
                            'slop' => 0,
                        ],
                    ],
                ],
                'filter' => [
                    'term' => [
                        'proof_id' => '56ebea535f5e8841038b4569',
                    ],
                ],
            ],
        ],
        '_source' => false,
        'fields' => [
            'proof_id',
            'proof_name',
            'project_id',
            'project_name',
            'page_number',
        ],
        'highlight' => [
            'fields' => [
                'content' => [
                    'type' => 'plain',
                    'fragment_size' => 100,
                    'number_of_fragments' => 100,
                    'fragmenter' => 'simple',
                ]
            ]
        ],
        'from' => 0,
        'size' => 10,
        'sort' => [
            'page_number' => [
                'order' => 'asc',
            ]
        ]
    ]
]

并假设我的一个与proof_id匹配的文件:56ebea535f5e8841038b4569包含类似的内容:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s.

我期待看到的结果是返回一个片段,其中突出显示以下内容:

Lorem Ipsum is simply dum

但它没有返回任何匹配,情况也是如此:

Lorem Ipsum is simply du
Lorem Ipsum is simply dumm

但它会返回以下匹配项:

Lorem Ipsum is simply d
Lorem Ipsum is simply dummy

这对我没有意义,因为我可以看到" dummy"的每个变体。作为矢量术语(ngram足以覆盖所有变化)。

值得指出的是,这只发生在搜索字符串末尾的术语中。例如:

m Ipsum is simply d
em Ipsum is simply d
rem Ipsum is simply d
orem Ipsum is simply d
Lorem Ipsum is simply d

全部按预期突出显示。

非常感谢任何帮助:)

全部谢谢!

0 个答案:

没有答案