我希望数据结构具有{(Document_name,term):(文档中的术语计数)}所以我创建了一个带有namedtuple的字典:
Doc_term = namedtuple("Doc_term", ["Doc", "term"])
Doc_term_count = {}
...
Doc_term_count[k] = {Doc_term(Doc_names[start_index + i], vocab[j]): row[j]}
k = k + 1
print Doc_term_count
它给我的数据结构为
{0: {Doc_term(Doc='book1.txt', term='be'): 1},
1: {Doc_term(Doc='book1.txt', term='script'): 1},
2: {Doc_term(Doc='book1.txt', term='this'): 1},
3: {Doc_term(Doc='book1.txt', term='is'): 1},
4: {Doc_term(Doc='book1.txt', term='there'): 1},
5: {Doc_term(Doc='book1.txt', term='wordcount'): 1},
6: {Doc_term(Doc='book2.txt', term='hello'): 2},
7: {Doc_term(Doc='book2.txt', term='to'): 1},
8: {Doc_term(Doc='book2.txt', term='book'): 1},
9: {Doc_term(Doc='book3.txt', term='read'): 1},
10: {Doc_term(Doc='book3.txt', term='by'): 1},
11: {Doc_term(Doc='book3.txt', term='first'): 1}}
我想搜索具有过滤/搜索功能的给定术语的文档数量,类似于:
Dtn = filter( lambda ndoc: Doc_term.term=='be', Doc_term_count)
print Dtn
它给了我null数组。请建议我哪里出错了。根据我的理解,我正在创建索引数组和过滤器lambda函数是期待列表,但当我尝试
Doc_term_count[(booknames[start_index + i], vocab[j])].append(row[j])
它给了我错误:KeyError:(' book1.txt',' be')。我认为它不接受元组作为关键。
答案 0 :(得分:1)
我相信您错误地生成了Doc_term_count
- 您只是希望将您的namedtuple映射到计数。如果不考虑你如何计算Doc_names和行索引,我猜你要做的就是:
Doc_term_count[Doc_term(Doc_names[start_index + i], vocab[j])] = row[j]
而不是
Doc_term_count[k] = {Doc_term(Doc_names[start_index + i], vocab[j]): row[j]}
第一种方法应该产生如下所示的字典:
Doc_term_count = {
Doc_term(Doc='book1.txt', term='be'): 1,
Doc_term(Doc='book1.txt', term='script'): 1,
Doc_term(Doc='book1.txt', term='this'): 1,
Doc_term(Doc='book1.txt', term='is'): 1,
Doc_term(Doc='book1.txt', term='there'): 1,
Doc_term(Doc='book1.txt', term='wordcount'): 1,
Doc_term(Doc='book2.txt', term='hello'): 2,
Doc_term(Doc='book2.txt', term='to'): 1,
Doc_term(Doc='book2.txt', term='book'): 1,
Doc_term(Doc='book3.txt', term='read'): 1,
Doc_term(Doc='book3.txt', term='by'): 1,
Doc_term(Doc='book3.txt', term='first'): 1
}
然后你可以使用你的元组来查找值:
print Doc_term_count[('book1.txt', 'be')] # prints 1