Question

我制作了一份列表字典＆＃39;对象和密钥是字符串格式。我访问了10个文档，并使用每个唯一的术语（单词）作为我的密钥并将其保存在列表中。即。word_tokens["abc"] = ["1:4","5:2","8:5"]表示word =＆＃34; abc＆＃34;在文档1中出现4次，在文档5中出现2次，在文档8中出现5次。

我的代码：

import nltk
from nltk.tokenize import word_tokenize


stop_words_file = open("englishST.txt",'r')

stop_words = []
for st in stop_words_file:
    st = st.strip()
    stop_words.append(st)

stop_words_file.close()


fileName = "docs-1/doc-"
word_tokens = {}          //dictionary object
cnt = 0
for i in range(1,10):
    file_name = fileName + str(i)

    file = open(file_name,'r')

    for sentence in file:
        word = []
        word = word_tokenize(sentence)
        for w in word:
            w = w.lower()
            if w not in stop_words:

                if word_tokens.get(w) == None:
                    dummy = []
                    dummy.append(str(i)+":1")
                    word_tokens[w] = dummy
                else:
                    dummy = []
                    dummy = word_tokens[w]
                    tempStr = dummy[-1]
                    temp = tempStr.split(':')
                    if temp[0] == str(i):
                        temp[1] = str(int(temp[1])+1)
                        dummy[-1] = temp[0]+':'+temp[1]
                        word_tokens[w] = dummy
                    else:  
                        dummy = word_tokens[w]
                        dummy.append(str(i)+":1")
                        word_tokens[w] = dummy
                 cnt = cnt+1

    file.close()
    if len(word_tokens) != 0:
        print(dict_count)
        fname = dictFileName + str(dict_count)
        f = open(fname, "w+")
        f.write(str(word_tokens))
        f.close()

    j = 1
    for key,val in word_tokens.items():
        print(j,key,val)
        j = j + 1


   print(word_tokens)

虽然直接打印字典没有多个具有相同值的键，但是当使用for循环遍历字典时，我得到多个键（即同一个键出现多次），我必须删除重复键并附加所有值将密钥复制到一个密钥中。

撰写print(word_tokens)

时

{'neurobeachin': ['1:1'], '(': ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1'], 'nbea': ['1:6'], ')': ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1'], 'regulates': ['1:1'], 'neuronal': ['1:1'], 'membrane': ['1:1'], 'protein': ['1:1', '8:2'], 'trafficking': ['1:1'], 'required': ['1:1'], 'development': ['1:1', '2:1', '6:1', '7:1', '9:2'],...... }

撰写此for key,val in word_tokens.items():

时

1 neurobeachin ['1:1']
2 ( ['1:5']
3 nbea ['1:6']
4 ) ['1:5']
.....
102 obesity ['1:1']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7']
3 nbea ['1:6']
4 ) ['1:5', '2:7']
......
220 investigation ['2:1']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3']
......
296 products ['3:1']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19']
...............

我想迭代每个（键，值）对，但它给出了我上面的答案，有人可以建议正确的方法吗？

Answer 1

我不熟悉nltk库。但是，我认为你看到＆＃34; 重复的原因＆＃34;是因为您的for key,val in word_tokens.items()嵌套在for i in range(1,10)下。

您是否尝试过从内到外移动for key,val in word_tokens.items()？

以下代码块相当大，但这样做是为了说明为什么我认为您遇到了问题。除了修复嵌套循环之外，您还应该努力使用with open()而不是file.open()进行上下文管理。

我拿了你的字典word_tokens并简单地执行了你的代码（当然没有解析标记）并取得了你想要的结果：

>>> word_tokens = {'neurobeachin': ['1:1'], '(': ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1'], 'nbea': ['1:6'], ')': ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1'], 'regulates': ['1:1'], 'neuronal': ['1:1'], 'membrane': ['1:1'], 'protein': ['1:1', '8:2'], 'trafficking': ['1:1'], 'required': ['1:1'], 'development': ['1:1', '2:1', '6:1', '7:1', '9:2']}
>>> j = 1
>>> for key, value in word_tokens.items():
        print (j, key, value)
        j = j + 1


1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
>>>

现在来测试我的假设（有点......因为字典在技术上会在嵌套循环中循环时增长）：

>>> for _ in range(1, 10):
        j = 1
        for key, value in word_tokens.items():
            print (j, key, value)
            j = j + 1


1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
1 neurobeachin ['1:1']
2 ( ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
3 nbea ['1:6']
4 ) ['1:5', '2:7', '3:3', '4:19', '5:5', '7:1', '8:2', '9:1']
5 regulates ['1:1']
6 neuronal ['1:1']
7 membrane ['1:1']
8 protein ['1:1', '8:2']
9 trafficking ['1:1']
10 required ['1:1']
11 development ['1:1', '2:1', '6:1', '7:1', '9:2']
>>>

在python中打印重复的键？

1 个答案: