我正在向量空间模型上工作,数据集包含50个文本文件。遍历它们分裂成单词并将其保存在字典中。现在我想使用嵌套字典,如:
dictionary = { {someword: {Doc1:23},{Doc21:2},{Doc34:3}},
{someword: {Doc1:23},{Doc21:2},{Doc34:3}},
{someword: {Doc1:23},{Doc21:2},{Doc34:3}}
}
但是当我运行程序时,它不仅会替换文档,而且不会通过添加特定文档中出现“ someword”的次数来计算频率。
for iterator in range(1, 51):
f = open(directory + str(iterator) + ext, "r")
for line in f.read().lower().split():
line = getwords(line)
for word in line:
if check(word, stopwords) == 0:
if existence(word, terms, iterator) != 1:
terms[word] = {}
terms[word]["Doc"+str(iterator)] = 1
else:
terms[word]["Doc"+str(iterator)] = int(terms[word]["Doc"+str(iterator)]) + 1
f.close()
存在函数是:
def existence(tok, diction, iteration):
if tok in diction:
temp = "Doc"+str(iteration)
if temp in diction:
return 1
else:
return 0
else:
return 0
结果有点像这样。
{'blunder': {'Doc1': 1}, 'by': {'Doc50': 1}, 'anton': {'Doc27': 1}, 'chekhov': {'Doc27': 1}, 'an': {'Doc50': 1}, 'illustration': {'Doc48': 1}, 'story': {'Doc48': 1}, 'author': {'Doc48': 1}, 'portrait'...
答案 0 :(得分:0)
您是否想知道每个单词在每个文件中出现多少次?只需defaultdict
个Counter
中的from collections import defaultdict, Counter
from string import punctuation
fnames = ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']
word_counter = defaultdict(Counter)
for fname in fnames:
with open(fname, 'r') as txt:
for line in txt:
words = line.lower().strip().split()
for word in words:
word = word.strip(punctuation)
if word:
word_counter[word][fname] += 1
即可,这是由collections模块提供的。
我认为您有正确的主意,遍历文件,逐行阅读并分解为单词。这是您需要帮助的重要部分。
word_counter
{
'within': {
'1.txt': 2,
},
'we': {
'1.txt': 3,
'2.txt': 2,
'3.txt': 2,
'4.txt': 2,
'5.txt': 4,
},
'do': {
'1.txt': 7,
'2.txt': 8,
'3.txt': 8,
'4.txt': 6,
'5.txt': 5,
},
...
}
中的数据外观如下:
let rdr = reader.headers().unwrap();
if let None = rdr.get(2) {
let mut rdr = rdr.clone();
rdr.push_field("field3");
reader.set_headers(rdr);
}