我是python的新手,我正在尝试将文本文件读入两个字典,其值为列表。
该文件包含以下内容:
term1 doc1 doc3 doc4
term2 doc5 doc1
term3 doc6 doc2
我正在尝试从同一个文件创建两个词典,一个将术语作为键和值作为文档,另一个将是相反的。
inverted_index = {}
forward_index = {}
with open('term_sample.txt') as file:
for line in file:
items = line.split()
term, doc = items[0], items[1:]
for doc in items[1:]
inverted_index[term] = [doc]
forward_index[doc] = [term]
print(inverted_index)
print(forward_index)
到目前为止,我已经完成了以下输出:
{'term2': ['doc1'], 'term1': ['doc4'], 'term3': ['doc2']}
{'doc3': ['term1'], 'doc6': ['term3'], 'doc4': ['term1'], 'doc5': ['term2'], 'doc1': ['term2'], 'doc2': ['term3']}
但这是我正在寻找的输出:
{'term1': ['doc1','doc3','doc4'], 'term2': ['doc5','doc1'], 'term3': ['doc6','doc2']}
{'doc1': ['term1','term2'], 'doc3': ['term1'], 'doc4': ['term1'], 'doc5': ['term2'], 'doc6': ['term3'], 'doc2': ['term3']}
请帮我解决这个问题!
答案 0 :(得分:3)
您不需要在内部循环中添加inverted_index
,只需为每一行添加一次。
在内部循环中,如果字典条目已经存在,则需要附加到字典条目,而不是覆盖它。
inverted_index = {}
forward_index = {}
with open('term_sample.txt') as file:
for line in file:
items = line.split()
term, doc = items[0], items[1:]
inverted_index[term] = doc
for doc in items[1:]
forward_index.setdefault(doc, []).append(term)
print(inverted_index)
print(forward_index)
答案 1 :(得分:1)
您可以使用defaultdict(list)
模块中的collections
- 每次密钥更新时都会在您的解决方案中使用:
#!/usr/bin/env python
from collections import defaultdict
inverted_index = defaultdict(list)
forward_index = defaultdict(list)
with open('term_sample.txt') as file:
for line in file:
items = line.split()
term, doc = items[0], items[1:]
for doc in items[1:]:
inverted_index[term].append(doc)
forward_index[doc].append(term)
print(inverted_index)
print(forward_index)
答案 2 :(得分:1)
inverted_index
不应该在内部for
中,而对于forward_index
,您替换了每个内部for
中的先前值。请尝试以下代码:
inverted_index = {}
forward_index = {}
with open('test') as f:
for line in f:
items = line.split()
term, docs = items[0], items[1:]
inverted_index[term] = docs
for doc in docs:
terms = forward_index.get(doc, [])
terms.append(term)
forward_index[doc] = terms
print(inverted_index)
print(forward_index)
答案 3 :(得分:1)
正如'编码员'建议的那样,我也会在这里使用defaultdict
。由于doc
可能会在多个term
中出现多次,因此您应使用set
来避免重复项:
from collections import defaultdict
inverted_index = defaultdict(set)
forward_index = defaultdict(list)
with open('term_sample.txt') as file:
for line in file:
items = line.split()
term, docs = items[0], items[1:]
inverted_index[term].update(docs)
for doc in docs:
forward_index[doc].append(term)
print(inverted_index)
print(forward_index)
(正如Barmar建议的那样,你只需要在外循环中分配forward_index
一次。)