Question

我正在尝试读取文本文件，删除标点符号，使所有内容都小写，然后打印单词总数，唯一单词的总数（例如，意思是“a”，如果它在文本中20次，只计算一次），然后打印最常出现的单词及其频率（即a：20）。

我意识到在StackOverflow上有类似的问题，但我是初学者并且正在尝试使用最少数量的导入来解决这个问题，并且想知道是否有办法对此进行编码而不导入像Collections这样的东西。

我的代码如下，但我不明白为什么我得不到我需要的答案。此代码正在打印整个文本文件（每个单词都在新行上，并删除所有标点符号），然后打印：

e 1
n 1
N 1
o 1

我认为，“无”将其频率分成字符。为什么我的代码给了我这个答案，我该怎么做才能改变它？

以下代码：

file=open("C:\\Users\\Documents\\AllSonnets.txt", "r")


def strip_sonnets():
    import string
    new_file=file.read().split()
    for words in new_file:
        data=words.translate(string.punctuation)
        data=data.lower()
        data=data.strip(".")
        data=data.strip(",")
        data=data.strip("?")
        data=data.strip(";")
        data=data.strip("!")
        data=data.replace("'","")
        data=data.replace('"',"")
        data=data.strip(":")
        print(data)

new_file=strip_sonnets()
new_file=str(new_file)

count={}
for w in new_file:
    if w in count:
        count[w] += 1
    else:
        count[w] = 1
for word, times in count.items():
    print (word, times)

Answer 1

如果您只想从单词的末尾删除标点符号，则不希望翻译。 collections.Counter词典也会为你计算单词：

from collections import Counter
from string import punctuation


with open("in.txt") as f:       
    c = Counter(word.http://stackoverflow.com/posts/29328942/editrstrip(punctuation) for line in f for  word in line.lower().split())

# print each word and how many times it appears
for k, freq in c.items():
   print(k,freq)

要按照最常见的顺序查看单词，您可以使用.most_common()：

for k,v in c.most_common():
    print(k,v)

如果没有导入，请使用dict.get：

c = {}
with open("in.txt") as f:
    for line in f:
        for word in line.lower().split():
            key = word.rstrip(punctuation)
            c[key] = c.get(key, 0) + 1

然后按频率排序：

from operator import itemgetter

for k,v in sorted(c.items(),key=itemgetter(1),reverse=True):
    print(k,v)

为什么你看到None是因为你设置new_file=strip_sonnets()并且你的函数没有返回任何内容，因为所有没有指定返回值的函数默认返回None。

然后设置new_file=str(new_file)，这样当您遍历for w in new_file时，您正在迭代None

中的每个字符

您需要返回数据：

def strip_sonnets():
    new_file=file.read().split()
    for words in new_file:
        data= words.translate(string.punctuation)
        data=data.lower()
        data=data.strip(".")
        data=data.strip(",")
        data=data.strip("?")
        data=data.strip(";")
        data=data.strip("!")
        data=data.replace("'","")
        data=data.replace('"',"")
        data=data.strip(":")
    return data # return

我会简化你的函数来返回一个生成器表达式，它返回所有被删除标点符号并降低的单词：

 path = "C:\\Users\\Documents\\AllSonnets.txt"

def strip_sonnets():
    with open(path, "r") as f:     
        return (word.lower().rstrip(punctuation) for line in f for word in line.split())

.rstrip(punctuation)基本上是在做你正在尝试使用strip和重复替换的代码。

计算文本文件中的单词数和唯一单词 - Python

1 个答案: