Question

我有一个如下的索引模式：

schema = Schema(
    title=TEXT(stored=True),
    content=TEXT,
    id=ID,
    topicID=NUMERIC,
)

我首先使用t获取主题searcher.documents(topicID=t)的文档。这会返回命中。我希望获得点击量的单词表示。 content字段。例如[(u'This',1),(u'is',1),(u'a',1),(u'document',1)]时的content=u'This is a document'。

如果有办法在Whoosh中更普遍地获得词袋表示（或TF-IDF） - 也许是文档而不是命中 - 这也是可以接受的。

编辑：我喜欢在索引时预处理词袋/ TF-IDF的解决方案，然后获得表示是一个单行函数或变量。每次我想要表示时，而不是现场处理。

Answer 1

whoosh.reading.IndexReader中已实现此功能：

whoosh.reading.IndexReader.frequency(fieldname, text)

返回给定术语的实例总数集合。
whoosh.reading.IndexReader.doc_frequency(fieldname, text)

返回给定术语出现的文档数。

要遍历所有索引术语的列表，请使用：

whoosh.reading.IndexReader.all_terms()

索引中每个术语的yield（fieldname，text）元组。

Answer 2

您可以使用计数器：

from collections import Counter

bow = Counter(content.split())

给出

Counter({'This': 1, 'a': 1, 'is': 1, 'document': 1})

Here是它的文档。

编辑：忘了一些括号

如何使用Whoosh获得文档内容的词袋表示？

2 个答案: