我正在将文本文件作为输入并创建一个函数来计算最常出现的单词。如果最常出现2个或更多单词并且相等,我将打印所有这些单词。
def wordOccurance(userFile):
userFile.seek(0)
line = userFile.readline()
lines = []
while line != "":
if line != "\n":
line = line.lower() # making lower case
line = line.rstrip("\n") # cleaning
line = line.rstrip("?") #cleans the whole docoument by removing "?"
line = line.rstrip("!") #cleans the whole docoument by removing "!"
line = line.rstrip(".") #cleans the whole docoument by removing "."
line = line.split(" ") #splits the texts into space
lines.append(line)
line = userFile.readline() # keep reading lines from document.
words = lines
wordDict = {} #creates the clean word Dic, from above
for word in words: #
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1
largest_value = max(wordDict.values())
for k in wordDict.keys():
if wordDict[k] == largest_value:
print(k)
return wordDict
请帮我这个功能。
答案 0 :(得分:0)
在这一行中,您将创建一个字符串列表:
line = line.split(" ") #splits the texts into space
然后将其附加到列表中,这样就有了一个列表列表:
lines.append(line)
稍后您遍历该列表列表,并尝试使用子列表作为键:
for word in words: #
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1 # Here you will try to assign a list (`word`) as a key, which is not allowed
一个简单的解决方法是首先展平列表列表:
words = [item for sublist in lines for item in sublist]
for word in words: #
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1
list comprehension [item for sublist in lines for item in sublist]
将遍历lines
,然后循环显示line.split(" ")
创建的子列表,并返回包含每个子列表中的项目的新列表。对你而言,lines
可能看起来像这样:
[['words', 'on', 'line', 'one'], ['words', 'on', 'line', 'two']]
列表理解将把它变成这个:
['words', 'on', 'line', 'one', 'words', 'on', 'line', 'two']
如果你想使用一些不那么复杂的东西,你可以使用嵌套循环:
# words = lines
# just use `lines` in your for loop instead of creating an identical list
wordDict = {} #creates the clean word Dic, from above
for line in lines:
for word in line:
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1
largest_value = max(wordDict.values())
这可能会有点效率低下和/或“Pythonic”,但它可能会更容易包裹你。
此外,您可能需要考虑在清理数据之前将每一行拆分为单词,因为如果先清除行,则只会删除行末而不是单词末尾的标点符号。但是,根据数据的性质,这可能不是必需的。