我制作了一份列表字典'对象和密钥是字符串格式。我访问了10个文档,并使用每个唯一的术语(单词)作为我的密钥并将其保存在列表中。即。word_tokens["abc"] = ["1:4","5:2","8:5"]
表示word =" abc"在文档1中出现4次,在文档5中出现2次,在文档8中出现5次。
我的代码:
import nltk
from nltk.tokenize import word_tokenize
stop_words_file = open("englishST.txt",'r')
stop_words = []
for st in stop_words_file:
st = st.strip()
stop_words.append(st)
stop_words_file.close()
fileName = "docs-1/doc-"
word_tokens = {} //dictionary object
cnt = 0
for i in range(1,10):
file_name = fileName + str(i)
file = open(file_name,'r')
for sentence in file:
word = []
word = word_tokenize(sentence)
for w in word:
w = w.lower()
if w not in stop_words:
if word_tokens.get(w) == None:
dummy = []
dummy.append(str(i)+":1")
word_tokens[w] = dummy
else:
dummy = []
dummy = word_tokens[w]
tempStr = dummy[-1]
temp = tempStr.split(':')
if temp[0] == str(i):
temp[1] = str(int(temp[1])+1)
dummy[-1] = temp[0]+':'+temp[1]
word_tokens[w] = dummy
else:
dummy = word_tokens[w]
dummy.append(str(i)+":1")
word_tokens[w] = dummy
cnt = cnt+1
file.close()
if len(word_tokens) != 0:
print(dict_count)
fname = dictFileName + str(dict_count)
f = open(fname, "w+")
f.write(str(word_tokens))
f.close()
j = 1
for key,val in word_tokens.items():
print(j,key,val)
j = j + 1
print(word_tokens)
虽然直接打印字典没有多个具有相同值的键,但是当使用for循环遍历字典时,我得到多个键(即同一个键出现多次),我必须删除重复键并附加所有值将密钥复制到一个密钥中。
撰写print(word_tokens)
{'neurobeachin': ['1:1'], '(': ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1'], 'nbea': ['1:6'], ')': ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1'], 'regulates': ['1:1'], 'neuronal': ['1:1'], 'membrane': ['1:1'], 'protein': ['1:1', '8:2'], 'trafficking': ['1:1'], 'required': ['1:1'], 'development': ['1:1', '2:1', '6:1', '7:1', '9:2'],...... }
撰写此for key,val in word_tokens.items():
1 neurobeachin ['1:1']
2 ( ['1:5']
3 nbea ['1:6']
4 ) ['1:5']
.....
102 obesity ['1:1']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7']
3 nbea ['1:6']
4 ) ['1:5', '2:7']
......
220 investigation ['2:1']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3']
......
296 products ['3:1']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19']
...............
我想迭代每个(键,值)对,但它给出了我上面的答案,有人可以建议正确的方法吗?
答案 0 :(得分:1)
我不熟悉nltk
库。但是,我认为你看到" 重复的原因"是因为您的for key,val in word_tokens.items()
嵌套在for i in range(1,10)
下。
您是否尝试过从内到外移动for key,val in word_tokens.items()
?
以下代码块相当大,但这样做是为了说明为什么我认为您遇到了问题。除了修复嵌套循环之外,您还应该努力使用with open()
而不是file.open()
进行上下文管理。
我拿了你的字典word_tokens
并简单地执行了你的代码(当然没有解析标记)并取得了你想要的结果:
>>> word_tokens = {'neurobeachin': ['1:1'], '(': ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1'], 'nbea': ['1:6'], ')': ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1'], 'regulates': ['1:1'], 'neuronal': ['1:1'], 'membrane': ['1:1'], 'protein': ['1:1', '8:2'], 'trafficking': ['1:1'], 'required': ['1:1'], 'development': ['1:1', '2:1', '6:1', '7:1', '9:2']}
>>> j = 1
>>> for key, value in word_tokens.items():
print (j, key, value)
j = j + 1
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
>>>
现在来测试我的假设(有点......因为字典在技术上会在嵌套循环中循环时增长):
>>> for _ in range(1, 10):
j = 1
for key, value in word_tokens.items():
print (j, key, value)
j = j + 1
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
>>>