我有三个文本文件已打开并读入我的程序。每个都包含我将从中提取最多单词的演讲。我已经将这些单词转换为小写,我试图摆脱我在为excel电子表格导出单词之前创建了一个列表的脏词,以便进一步分析。
我尝试了多个网站的多个选项,但我被卡住了。
这就是我所拥有的:
hitList = ["am", "as", "is", "of", "the", "it", "or", "and", "to", "I", "a", "have", "you", "we", "they", "It's", "don't", "our", "so", "for", "-", ".", "but", "out"]
txt = file.read().lower()
words = txt.split()
x = {}
sumChars = len(words)
sumLines = txt.count("\n")
# Iterate through the words and append the list with new words
for i in words:
if i in x:
try:
if x not in hitList:
x[i] += 1
except:
print("Oops... A word could not be added to the list.")
break
killCall()
else: x[i] = 1
lst = [(x[i], i) for i in x]
lst.sort()
lst.reverse()
sumWords = sum(x[i] for i in x)
strsumChars = str(sumChars)
strsumLines = str(sumLines)
strsumWords = str(sumWords)
# Convert the final list 'x' into lowercase values to ensure proper sorting
print(filename + " contains " + strsumChars + " characters.")
print(filename + " contains " + strsumLines + " lines.")
print(filename + " contains " + strsumWords + " words. \n\n")
print("The 30 most frequent words in " + filename + " are: ")
g = 1
for count, word in lst[:50]:
op = print('%2s. %4s %s' % (g, count, word))
g+=1
if savesheet == "Cleveland_Aug62015":
workbook = xlwt.Workbook()
col2 = "Word Count"
col3 = "Words"
worksheet = workbook.add_sheet("Cleveland_Aug62015", cell_overwrite_ok = True)
worksheet.write(0,0, col2)
worksheet.write(0,1, col3)
try:
for h, l in enumerate(lst[:50], start = 1):
for j, col in enumerate(l):
worksheet.write(h, j, col)
print("\n" + savesheet + " exported to Excel...")
except: print("\n" + savesheet + " unable to be saved to Excel...")
workbook.save(xlsfile + "_" + savesheet + ".xls")
调用文本文件和其他东西的所有其他变量等,我刚刚在这里发布了问题区域。我仍然在标记它,所以我没有错误地把所有东西都困住等等。
我遇到的主要问题是:
# Iterate through the words and append the list with new words
for i in words:
if i in x:
try:
if x not in hitList:
x[i] += 1
except:
print("Oops... A word could not be added to the list.")
break
killCall()
else: x[i] = 1
lst = [(x[i], i) for i in x]
lst.sort()
lst.reverse()
我试图在创建输出列表之前删除脏字,但脏字仍然显示出来。
非常感谢任何帮助
布兰登
答案 0 :(得分:0)
在您提供的较低代码中的if-else语句中,您正在测试有问题的字词是否在 hitList
的if
中但不在 else
中。因此,您不希望添加的每个单词至少添加一次。
在else
中提供与if
中相同的保护应该有所帮助。或者,更好的是,将整个if-else包装在if x not in hitList:
。
此外,正如Andrea Corbellini指出的那样,使用Counter
可以极大地简化您的代码。
答案 1 :(得分:0)
您的程序包含多个错误,效率低下并且错过了几个Python习语。这是程序第一部分的修改(假定Python3打印语句):
from collections import Counter
from operator import itemgetter
# ...
MOST_FREQUENT = 30
hitList = {"am", "as", "is", "of", "the", "it", "or", "and", "to", "i", "a", "have", "you", "we", "they", "it's", "don't", "our", "so", "for", "-", ".", "but", "out"}
rawText = file.read().lower()
sumLines = rawText.count("\n")
words = rawText.split()
sumChars = sum(len(word) for word in words) # includes hitList
# sumChars = sum(len(word) for word in words if word not in hitList) # excludes hitList
wordCounts = Counter([word for word in words if word not in hitList])
sumWords = sum(wordCounts.values())
print(filename, "contains", sumChars, "characters.")
print(filename, "contains", sumLines, "lines.")
print(filename, "contains", sumWords, "words.", end="\n\n\n")
wordHistogram = wordCounts.most_common(MOST_FREQUENT)
print("The", MOST_FREQUENT, "most frequent words in", filename, "are:")
for g, (word, count) in enumerate(wordHistogram[:MOST_FREQUENT]):
print('%2s. %4s %s' % (g + 1, count, word))
# ...