Question

我正在编写一个程序，它从互联网上抓取一个txt文件并读取它。然后它显示一组与该txt文件相关的数据。现在，这一切都运作良好，直到我们结束。我要做的最后一件事是显示txt文件中使用的前10个最常用的单词。我现在的代码只显示最常用的单词10次。有人可以看看这个并告诉我问题是什么吗？你必须要看的唯一部分是最后一部分。

import urllib
open = urllib.urlopen("http://www.textfiles.com/etext/FICTION/alice30.txt").read()

v = str(open)                 # this variable makes the file a string
strip = v.replace(" ", "")        # this trims spaces
char = len(strip)    # this variable counts the number of characters in the string
ch = v.splitlines()    # this variable seperates the lines

line = len(ch)         # this counts the number of lines


print "Here's the number of lines in your file:", line

wordz = v.split()
print wordz

print "Here's the number of characters in your file:", char

spaces = v.count(' ')

words = ''.join(c if c.isalnum() else ' ' for c in v).split()

words = len(words)

print "Here's the number of words in your file:", words

topten = map(lambda x:filter(str.isalpha,x.lower()),v.split())
print "\n".join(sorted(words,key=words.count)[-10:][::-1])

Answer 1

使用collections.Counter计算所有单词，Counter.most_common(10)将返回十个最常用的单词及其计数

wordz = v.split()
from collections import Counter
c = Counter(wordz)
print(c.most_common(10))

使用with打开文件并获取txt文件中所有单词的计数：

from collections import Counter
with open("http://www.textfiles.com/etext/FICTION/alice30.txt") as f:
    c = Counter()
    for line in f:
        c.update(line.split()) # Counter.update adds the values 
print(c.most_common(10))

要获取文件中的总字符数，请获取每个密钥长度的总和乘以它出现的次数：

print(sum(len(k)*v for k,v in c.items()))

获得字数：

print(sum(c.values()))

显示字符串中的前10个单词

1 个答案: