干草堆搜索一个非常短的字符字段

时间:2014-10-09 00:52:31

标签: python django django-haystack

我正在使用Haystack构建一个搜索引擎,我正在处理的一个功能是允许人们按版本字段进行过滤,如下所示:

version = indexes.CharField(model_attr="version")

版本是短字符串,不受语义"版本限制。遵循" x.y.z"风格,可能就像" 1"。

一样简单

不幸的是,经过一些实验后,看起来Haystack忽略了短于3个字符的过滤器。所以这个:

SearchQuerySet().filter(version="1")

实际上什么都不会返回,而这个:

SearchQuerySet().filter(content="foo").filter(version="1")

将返回与第一个过滤器匹配的所有内容。

经过一些实验,我发现它基于字符串长度,而不是数字字段。所以这些都表现得很相似:

SearchQuerySet().filter(version="1")
SearchQuerySet().filter(version="a")
SearchQuerySet().filter(version="1a")

这些是有效的(如果某个项目的version设置为"100"):

SearchQuerySet().filter(version=100)
SearchQuerySet().filter(version="100")

现在显然,我不希望每个字段都具有这种级别的粒度,但无论如何要说明对于特定字段,我希望过滤甚至可以在单个字符上工作吗?

1 个答案:

答案 0 :(得分:4)

我在考虑后端whoosh时给出了答案。但这可以通过研究他们的规则适用于其他后端。

django-haystack use StemmingAnalyzer导入的{p> whoosh.analysis.StemmingAnalyzer用于Text (char) field的方法build_schema中的WhooshSearchBackend。从whoosh.analysis.StemmingAnalyzer开始,您可以看到minsize参数默认设置为2,因此您无法对一个字符进行过滤。我们需要覆盖build_schema中的WhooshSearchBackend方法,并为minszie1参数设置为StemmingAnalyzer

将此代码放在 search_backends.py

from haystack.backends.whoosh_backend import WhooshEngine, WhooshSearchBackend, WHOOSH_ID, ID, DJANGO_CT, DJANGO_ID, Schema, IDLIST, TEXT, KEYWORD, NUMERIC, BOOLEAN, DATETIME, NGRAM, NGRAMWORDS

from whoosh.analysis import StemmingAnalyzer

class CustomSearchBackend(WhooshSearchBackend):
    def build_schema(self, fields):
        schema_fields = {
            ID: WHOOSH_ID(stored=True, unique=True),
            DJANGO_CT: WHOOSH_ID(stored=True),
            DJANGO_ID: WHOOSH_ID(stored=True),
        }
        # Grab the number of keys that are hard-coded into Haystack.
        # We'll use this to (possibly) fail slightly more gracefully later.
        initial_key_count = len(schema_fields)
        content_field_name = ''

        for field_name, field_class in fields.items():
            if field_class.is_multivalued:
                if field_class.indexed is False:
                    schema_fields[field_class.index_fieldname] = IDLIST(stored=True, field_boost=field_class.boost)
                else:
                    schema_fields[field_class.index_fieldname] = KEYWORD(stored=True, commas=True, scorable=True, field_boost=field_class.boost)
            elif field_class.field_type in ['date', 'datetime']:
                schema_fields[field_class.index_fieldname] = DATETIME(stored=field_class.stored)
            elif field_class.field_type == 'integer':
                schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, type=int, field_boost=field_class.boost)
            elif field_class.field_type == 'float':
                schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, type=float, field_boost=field_class.boost)
            elif field_class.field_type == 'boolean':
                # Field boost isn't supported on BOOLEAN as of 1.8.2.
                schema_fields[field_class.index_fieldname] = BOOLEAN(stored=field_class.stored)
            elif field_class.field_type == 'ngram':
                schema_fields[field_class.index_fieldname] = NGRAM(minsize=3, maxsize=15, stored=field_class.stored, field_boost=field_class.boost)
            elif field_class.field_type == 'edge_ngram':
                schema_fields[field_class.index_fieldname] = NGRAMWORDS(minsize=2, maxsize=15, at='start', stored=field_class.stored, field_boost=field_class.boost)
            else:
                schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(minsize=1), field_boost=field_class.boost)

            if field_class.document is True:
                content_field_name = field_class.index_fieldname

        # Fail more gracefully than relying on the backend to die if no fields
        # are found.
        if len(schema_fields) <= initial_key_count:
            raise SearchBackendError("No fields were found in any search_indexes. Please correct this before attempting to search.")

        return (content_field_name, Schema(**schema_fields))

class CustomWhooshEngine(WhooshEngine):
    backend = CustomSearchBackend

现在我们需要告诉haystack使用我们的CustomSearchBackend

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'search_backends.CustomWhooshEngine',
        'PATH': os.path.join(os.path.dirname(__file__), 'whoosh_index'),
    },
}

执行此运行命令rebuild_indexupdate_index后,您应该可以对单个字符进行过滤,但字母a除外,因为a字母也在STOP_WORDS中如果您还想允许单个字符a,则需要通过在a中删除build_schema这样的字母from whoosh.analysis import STOP_WORDS STOP_WORDS = frozenset([el for el in STOP_WORDS if len(el) > 1]) # remove all single letter stop words class CustomSearchBackend(WhooshSearchBackend): def build_schema(self, fields): # rest of code # ------ else: schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(minsize=1, stoplist=STOP_WORDS), field_boost=field_class.boost) 来传递您的STOP_WORDS:

build_schema

注意:whoosh=2.4代码可能因haystack版本而异。上述代码使用haystack==2.0.0和{{1}}

进行了测试