我正在编写一个程序,它从互联网上抓取一个txt文件并读取它。然后它显示一组与该txt文件相关的数据。现在,这一切都运作良好,直到我们结束。我要做的最后一件事是显示txt文件中使用的前10个最常用的单词。我现在的代码只显示最常用的单词10次。有人可以看看这个并告诉我问题是什么吗?你必须要看的唯一部分是最后一部分。
import urllib
open = urllib.urlopen("http://www.textfiles.com/etext/FICTION/alice30.txt").read()
v = str(open) # this variable makes the file a string
strip = v.replace(" ", "") # this trims spaces
char = len(strip) # this variable counts the number of characters in the string
ch = v.splitlines() # this variable seperates the lines
line = len(ch) # this counts the number of lines
print "Here's the number of lines in your file:", line
wordz = v.split()
print wordz
print "Here's the number of characters in your file:", char
spaces = v.count(' ')
words = ''.join(c if c.isalnum() else ' ' for c in v).split()
words = len(words)
print "Here's the number of words in your file:", words
topten = map(lambda x:filter(str.isalpha,x.lower()),v.split())
print "\n".join(sorted(words,key=words.count)[-10:][::-1])
答案 0 :(得分:2)
使用collections.Counter
计算所有单词,Counter.most_common(10)
将返回十个最常用的单词及其计数
wordz = v.split()
from collections import Counter
c = Counter(wordz)
print(c.most_common(10))
使用with打开文件并获取txt文件中所有单词的计数:
from collections import Counter
with open("http://www.textfiles.com/etext/FICTION/alice30.txt") as f:
c = Counter()
for line in f:
c.update(line.split()) # Counter.update adds the values
print(c.most_common(10))
要获取文件中的总字符数,请获取每个密钥长度的总和乘以它出现的次数:
print(sum(len(k)*v for k,v in c.items()))
获得字数:
print(sum(c.values()))