如何从输出中删除脏字列表

时间:2016-04-17 16:41:30

标签: python

我有三个文本文件已打开并读入我的程序。每个都包含我将从中提取最多单词的演讲。我已经将这些单词转换为小写,我试图摆脱我在为excel电子表格导出单词之前创建了一个列表的脏词,以便进一步分析。

我尝试了多个网站的多个选项,但我被卡住了。

这就是我所拥有的:

hitList = ["am", "as", "is", "of", "the", "it", "or", "and", "to", "I", "a", "have", "you", "we", "they", "It's", "don't", "our", "so", "for", "-", ".", "but", "out"]

txt = file.read().lower()
words = txt.split()
x = {}
sumChars = len(words)
sumLines = txt.count("\n")
# Iterate through the words and append the list with new words
for i in words:
    if i in x:
        try:
            if x not in hitList:
                x[i] += 1
        except:
            print("Oops... A word could not be added to the list.")
            break
            killCall()
    else: x[i] = 1
lst = [(x[i], i) for i in x]
lst.sort()
lst.reverse()
sumWords = sum(x[i] for i in x)
strsumChars = str(sumChars)
strsumLines = str(sumLines)
strsumWords = str(sumWords)
# Convert the final list 'x' into lowercase values to ensure proper sorting
print(filename + " contains " + strsumChars + " characters.")
print(filename + " contains " + strsumLines + " lines.")
print(filename + " contains " + strsumWords + " words. \n\n")
print("The 30 most frequent words in " + filename + " are: ")
g = 1
for count, word in lst[:50]:
    op = print('%2s.  %4s %s' % (g, count, word))
    g+=1

if savesheet == "Cleveland_Aug62015":
    workbook = xlwt.Workbook()
    col2 = "Word Count"
    col3 = "Words"
    worksheet = workbook.add_sheet("Cleveland_Aug62015", cell_overwrite_ok = True)
    worksheet.write(0,0, col2)
    worksheet.write(0,1, col3)
    try:
        for h, l in enumerate(lst[:50], start = 1):
            for j, col in enumerate(l):
                worksheet.write(h, j, col)
        print("\n" + savesheet + " exported to Excel...")
    except: print("\n" + savesheet + " unable to be saved to Excel...")
    workbook.save(xlsfile + "_" + savesheet + ".xls")

调用文本文件和其他东西的所有其他变量等,我刚刚在这里发布了问题区域。我仍然在标记它,所以我没有错误地把所有东西都困住等等。

我遇到的主要问题是:

# Iterate through the words and append the list with new words
for i in words:
    if i in x:
        try:
            if x not in hitList:
                x[i] += 1
        except:
            print("Oops... A word could not be added to the list.")
            break
            killCall()
    else: x[i] = 1
lst = [(x[i], i) for i in x]
lst.sort()
lst.reverse()

我试图在创建输出列表之前删除脏字,但脏字仍然显示出来。

非常感谢任何帮助

布兰登

2 个答案:

答案 0 :(得分:0)

在您提供的较低代码中的if-else语句中,您正在测试有问题的字词是否在 hitList if中但不在 else 中。因此,您不希望添加的每个单词至少添加一次。

else中提供与if中相同的保护应该有所帮助。或者,更好的是,将整个if-else包装在if x not in hitList:

此外,正如Andrea Corbellini指出的那样,使用Counter可以极大地简化您的代码。

答案 1 :(得分:0)

您的程序包含多个错误,效率低下并且错过了几个Python习语。这是程序第一部分的修改(假定Python3打印语句):

from collections import Counter
from operator import itemgetter

# ...

MOST_FREQUENT = 30

hitList = {"am", "as", "is", "of", "the", "it", "or", "and", "to", "i", "a", "have", "you", "we", "they", "it's", "don't", "our", "so", "for", "-", ".", "but", "out"}

rawText = file.read().lower()

sumLines = rawText.count("\n")

words = rawText.split()

sumChars = sum(len(word) for word in words)  # includes hitList
# sumChars = sum(len(word) for word in words if word not in hitList)  # excludes hitList

wordCounts = Counter([word for word in words if word not in hitList])

sumWords = sum(wordCounts.values())

print(filename, "contains", sumChars, "characters.")
print(filename, "contains", sumLines, "lines.")
print(filename, "contains", sumWords, "words.", end="\n\n\n")

wordHistogram = wordCounts.most_common(MOST_FREQUENT)

print("The", MOST_FREQUENT, "most frequent words in", filename, "are:")

for g, (word, count) in enumerate(wordHistogram[:MOST_FREQUENT]):
    print('%2s.  %4s %s' % (g + 1, count, word))

# ...