Question

我正在尝试编写一个程序来读取名为“GlassDog.txt”的文本文档中的所有单词。一旦程序读取单词，它将需要删除所有标点符号，以及使所有字母小写。然后，当程序完成所有这些后，我希望它能够打印它找到的单词以及在文档中使用它的次数。

到目前为止，这是我的代码：

def run():
    count = {} 
    for w in open('GlassDog.txt').read().split(): 
        if w in count: 
            count[w] += 1 
        else: 
            count[w] = 1

    for word, times in count.items(): 
        print ("%s was found %d times" % (word, times)) 

run()

此代码将读取并显示单词和单词的频率。但是，我找不到如何实现代码的方法，该代码将删除标点符号并用小写字母替换大写字母。这个问题可能已被问过几次，我似乎无法找到任何特别符合我要求的东西。如果这是一个重复的问题，我道歉。

Answer 1

您可以在字符串上使用.lower（）以在if块之前转换为小写，并且仅用于匹配字母数字试用正则表达式，请特别注意\ w

Answer 2

from collections import Counter

def just_alnum(s):
    return ''.join(c for c in s if c.isalnum())

with open('GlassDog.txt', 'r') as f:
    counts = Counter(just_alnum(w.lower()) for w in f.read().split())

Answer 3

>>>msg = "Hello,World!"
>>>msg = msg.lower() #convert into all lowercase
>>>print msg
hello,world!
>>>msg = filter(lambda x: x.isalpha(), msg) #remove every character that isn't a letter
>>>print msg
helloworld

Answer 4

这种方法肯定不是最优化的，但我认为它很健壮：

>>> msg = "A   very42 dirty__ string ©."
# Replace all non alphabetical characters (maybe you want isalnum() instead)
>>> msg = map(lambda x: x if x.isalpha() else ' ', msg)
# Concat splitted chars
>>> msg = ''.join(msg)
# Avoid multiple spaces
>>> msg = ' '.join(msg.split())
>>> msg
'A very dirty string'

在庞大而异构的输入上，它将消耗大量的资源，因此，如果你想要更优化的东西，你应该根据你对输入文件的了解来调整它（例如：标点符号是否总是被包围通过空格？）。

此外，您可以在一行中完成所有这些工作，但对于您的代码的下一位读者来说可能很难理解......

文本文档中的字频率，不包括标点符号

4 个答案: