如何配置Haystack / Elasticsearch来处理单词开头附近的收缩和撇号

时间:2014-09-04 14:20:00

标签: django elasticsearch django-haystack

我有一段时间试图在单词的开头或中间处理撇号字符。我能够处理占有欲的英语,但我也试图迎合法语并处理像“动作”这样的词,其中撇号字符出现在单词的开头而不是像“她的”那样。

通过haystack auto_query搜索“d action”将返回结果,但“d'action”不返回任何内容。如果我直接查询elasticsearch _search API(_search?q = D%27ACTION),我会得到“d'action”的结果。因此,我想知道这是否是干草堆引擎问题。

我的配置:

'settings': {
    "analysis": {
        "char_filter": {
            "quotes": {
                "type": "mapping",
                "mappings": [
                    "\\u0091=>\\u0027",
                    "\\u0092=>\\u0027",
                    "\\u2018=>\\u0027",
                    "\\u2019=>\\u0027",
                    "\\u201B=>\\u0027"
                ]
            }
        },
        "analyzer": {
            "ch_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ['ch_en_possessive_word_delimiter', 'ch_fr_stemmer'],
                "char_filter": ['html_strip', 'quotes'],
            },
        },

        "filter": {
            "ch_fr_stemmer" : {
                "type": "snowball",
                "language": "French"
            },
            "ch_en_possessive_word_delimiter": {
                "type": "word_delimiter",
                "stem_english_possessive": True
            }
        }
    }
}

我还有ElasticsearchSearchBackend和BaseEngine的子类,所以我可以添加上面的配置:

class ConfigurableESBackend(ElasticsearchSearchBackend):
    # Word reserved by Elasticsearch for special use.
    RESERVED_WORDS = (
        'AND',
        'NOT',
        'OR',
        'TO',
    )

    # Characters reserved by Elasticsearch for special use.
    # The '\\' must come first, so as not to overwrite the other slash replacements.
    RESERVED_CHARACTERS = (
        '\\', '+', '-', '&&', '||', '!', '(', ')', '{', '}',
        '[', ']', '^', '"', '~', '*', '?', ':',
    )

    def setup(self):
        """
        Defers loading until needed.
        """
        # Get the existing mapping & cache it. We'll compare it
        # during the ``update`` & if it doesn't match, we'll put the new
        # mapping.
        try:
            self.existing_mapping = self.conn.get_mapping(index=self.index_name)
        except Exception:
            if not self.silently_fail:
                raise

        unified_index = haystack.connections[self.connection_alias].get_unified_index()
        self.content_field_name, field_mapping = self.build_schema(unified_index.all_searchfields())
        current_mapping = {
            'modelresult': {
                'properties': field_mapping,
                '_boost': {
                    'name': 'boost',
                    'null_value': 1.0
                }
            }
        }

        if current_mapping != self.existing_mapping:
            try:
                # Make sure the index is there first.
                self.conn.create_index(self.index_name, settings.ELASTICSEARCH_INDEX_SETTINGS)
                self.conn.put_mapping(self.index_name, 'modelresult', mapping=current_mapping)
                self.existing_mapping = current_mapping
            except Exception:
                if not self.silently_fail:
                    raise

        self.setup_complete = True

class CHElasticsearchSearchEngine(BaseEngine):
    backend = ConfigurableESBackend
    query = ElasticsearchSearchQuery

1 个答案:

答案 0 :(得分:6)

好的,这与配置无关,而是用于干草堆索引的.txt模板的问题。

我有:

{{ object.some_model.name_en }}
{{ object.some_model.name_fr }}

导致人物喜欢'要转换为html权限('),这导致搜索永远不会找到结果。使用" safe"解决了这个问题:

{{ object.some_model.name_en|safe }}
{{ object.some_model.name_fr|safe }}