Question

python入门。我正在尝试使用嵌套字典实现位置索引。但是，我不确定这是否可行。索引应包含术语/术语频率/文档ID /术语位置。

示例：

dict = {term: {termfreq: {docid: {[pos1,pos2,...]}}}}

我的问题是：我在这里是正确的轨道还是有更好的解决方案来解决我的问题。如果嵌套字典是要走的路，我还有一个问题：如何从字典中获取单个项目：例如术语的术语频率（没有关于该术语的所有其他信息）。非常感谢您的帮助。

Answer 1

每个term似乎都有一个术语频率，一个doc id和一个位置列表。是对的吗？如果是这样，你可以使用dicts的词典：

dct = { 'wassup' : {
            'termfreq' : 'daily',
            'docid' : 1,
            'pos' : [3,4] }}

然后，给定一个术语，如'wassup'，你可以用

查找术语频率

dct['wassup']['termfreq']
# 'daily'

将dict想象成电话簿。它非常适合查找给定键（名称）的值（电话号码）。在给定值的情况下查找键不是那么热门。当你知道你需要单向查看时，请使用dict。如果您的查找模式更复杂，您可能需要一些其他数据结构（可能是数据库？）。

您可能还想查看Natural Language Toolkit (nltk)。它内置method for calculating tf_idf：

import nltk

# Given a corpus of texts
text1 = 'Lorem ipsum FOO dolor BAR sit amet'
text2 = 'Ut enim ad FOO minim veniam, '
text3 = 'Duis aute irure dolor BAR in reprehenderit '
text4 = 'Excepteur sint occaecat BAR cupidatat non proident'

# We split the texts into tokens, and form a TextCollection
mytexts = (
    [nltk.word_tokenize(text) for text in [text1, text2, text3, text4]])
mycollection = nltk.TextCollection(mytexts)

# Given a new text
text = 'et FOO tu BAR Brute'
tokens = nltk.word_tokenize(text)

# for each token (roughly, word) in the new text, we compute the tf_idf
for word in tokens:
    print('{w}: {s}'.format(w = word,
                            s = mycollection.tf_idf(word,tokens)))

产量

et: 0.0
FOO: 0.138629436112
tu: 0.0
BAR: 0.0575364144904
Brute: 0.0

使用python的位置索引

1 个答案: