我正在使用Haystack构建一个搜索引擎,我正在处理的一个功能是允许人们按版本字段进行过滤,如下所示:
version = indexes.CharField(model_attr="version")
版本是短字符串,不受语义"版本限制。遵循" x.y.z"风格,可能就像" 1"。
一样简单不幸的是,经过一些实验后,看起来Haystack忽略了短于3个字符的过滤器。所以这个:
SearchQuerySet().filter(version="1")
实际上什么都不会返回,而这个:
SearchQuerySet().filter(content="foo").filter(version="1")
将返回与第一个过滤器匹配的所有内容。
经过一些实验,我发现它基于字符串长度,而不是数字字段。所以这些都表现得很相似:
SearchQuerySet().filter(version="1")
SearchQuerySet().filter(version="a")
SearchQuerySet().filter(version="1a")
这些是有效的(如果某个项目的version
设置为"100"
):
SearchQuerySet().filter(version=100)
SearchQuerySet().filter(version="100")
现在显然,我不希望每个字段都具有这种级别的粒度,但无论如何要说明对于特定字段,我希望过滤甚至可以在单个字符上工作吗?
答案 0 :(得分:4)
我在考虑后端whoosh
时给出了答案。但这可以通过研究他们的规则适用于其他后端。
minsize
参数默认设置为2
,因此您无法对一个字符进行过滤。我们需要覆盖build_schema
中的WhooshSearchBackend
方法,并为minszie
将1
参数设置为StemmingAnalyzer
:
将此代码放在 search_backends.py :
中from haystack.backends.whoosh_backend import WhooshEngine, WhooshSearchBackend, WHOOSH_ID, ID, DJANGO_CT, DJANGO_ID, Schema, IDLIST, TEXT, KEYWORD, NUMERIC, BOOLEAN, DATETIME, NGRAM, NGRAMWORDS
from whoosh.analysis import StemmingAnalyzer
class CustomSearchBackend(WhooshSearchBackend):
def build_schema(self, fields):
schema_fields = {
ID: WHOOSH_ID(stored=True, unique=True),
DJANGO_CT: WHOOSH_ID(stored=True),
DJANGO_ID: WHOOSH_ID(stored=True),
}
# Grab the number of keys that are hard-coded into Haystack.
# We'll use this to (possibly) fail slightly more gracefully later.
initial_key_count = len(schema_fields)
content_field_name = ''
for field_name, field_class in fields.items():
if field_class.is_multivalued:
if field_class.indexed is False:
schema_fields[field_class.index_fieldname] = IDLIST(stored=True, field_boost=field_class.boost)
else:
schema_fields[field_class.index_fieldname] = KEYWORD(stored=True, commas=True, scorable=True, field_boost=field_class.boost)
elif field_class.field_type in ['date', 'datetime']:
schema_fields[field_class.index_fieldname] = DATETIME(stored=field_class.stored)
elif field_class.field_type == 'integer':
schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, type=int, field_boost=field_class.boost)
elif field_class.field_type == 'float':
schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, type=float, field_boost=field_class.boost)
elif field_class.field_type == 'boolean':
# Field boost isn't supported on BOOLEAN as of 1.8.2.
schema_fields[field_class.index_fieldname] = BOOLEAN(stored=field_class.stored)
elif field_class.field_type == 'ngram':
schema_fields[field_class.index_fieldname] = NGRAM(minsize=3, maxsize=15, stored=field_class.stored, field_boost=field_class.boost)
elif field_class.field_type == 'edge_ngram':
schema_fields[field_class.index_fieldname] = NGRAMWORDS(minsize=2, maxsize=15, at='start', stored=field_class.stored, field_boost=field_class.boost)
else:
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(minsize=1), field_boost=field_class.boost)
if field_class.document is True:
content_field_name = field_class.index_fieldname
# Fail more gracefully than relying on the backend to die if no fields
# are found.
if len(schema_fields) <= initial_key_count:
raise SearchBackendError("No fields were found in any search_indexes. Please correct this before attempting to search.")
return (content_field_name, Schema(**schema_fields))
class CustomWhooshEngine(WhooshEngine):
backend = CustomSearchBackend
现在我们需要告诉haystack使用我们的CustomSearchBackend
:
HAYSTACK_CONNECTIONS = {
'default': {
'ENGINE': 'search_backends.CustomWhooshEngine',
'PATH': os.path.join(os.path.dirname(__file__), 'whoosh_index'),
},
}
执行此运行命令rebuild_index
和update_index
后,您应该可以对单个字符进行过滤,但字母a
除外,因为a
字母也在STOP_WORDS中如果您还想允许单个字符a
,则需要通过在a
中删除build_schema
这样的字母from whoosh.analysis import STOP_WORDS
STOP_WORDS = frozenset([el for el in STOP_WORDS if len(el) > 1]) # remove all single letter stop words
class CustomSearchBackend(WhooshSearchBackend):
def build_schema(self, fields):
# rest of code
# ------
else:
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(minsize=1, stoplist=STOP_WORDS), field_boost=field_class.boost)
来传递您的STOP_WORDS:
build_schema
注意:whoosh=2.4
代码可能因haystack版本而异。上述代码使用haystack==2.0.0
和{{1}}