您好earlier post。
给出以下列表:
<'Jellicle','猫','是','黑','和','白色','Jellicle','猫','是','宁可','小;', 'Jellicle','Cats','are','merry','and','bright,','and','pleasant','to','hear','when','they',' caterwaul。','Jellicle','猫','有','开朗','面孔','Jellicle','猫','有','明亮','黑','眼睛;', '他们','喜欢','到','练习','他们','空气','和','增光','和','等',''',''','Jellicle ','月亮','到','上升。','']
我正在尝试计算每个单词出现的次数,并显示前3个。
我对那些不以资本开头的话感兴趣。
如果一个单词出现多次,有时以资本开头而有时不出现,只计算它与资本的次数。
这就是我目前的代码:
words = ""
for word in open('novel.txt', 'rU'):
words += word
words = words.split(' ')
words= list(words)
words = ('\n'.join(words)).split('\n')
word_counter = {}
for word in words:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
top_3 = popular_words[:3]
matches = []
for i in range(3):
print word_counter[top_3[i]], top_3[i]
答案 0 :(得分:7)
#uncomment to produce the word file
##words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
##open('novel.txt','w').write('\n'.join(words))
import string
cap_words = [word.strip(string.punctuation) for word in open('novel.txt').read().split() if word.istitle()]
##print(cap_words) # debug
try:
from collections import Counter # Python >= 2.7
print('Counter')
print(Counter(cap_words).most_common(3))
except ImportError:
print('Normal dict')
wordcount= dict()
for word in cap_words:
wordcount[word] = (wordcount[word] + 1
if word in wordcount
else 1)
print(sorted(wordcount.items(), key = lambda x: x[1], reverse = True)[:3])
我不明白为什么你想用'rU'模式保持不同类型的线路终端。正如我在上面的编辑代码中所写的那样,我会正常使用普通开放 编辑:你有标点符号,所以用strip()
清理那些答案 1 :(得分:3)
print "\n".join(sorted(["%d %s" % (lst.count(i), i) \
for i in set(lst) if i.istitle()])[-3:])
2 And
5 Cats
6 Jellicle
答案 2 :(得分:2)
以下是其他一些评论:
words = ""
for word in open('novel.txt', 'rU'):
words += word
words = words.split(' ')
words= list(words)
words = ('\n'.join(words)).split('\n')
可以替换为:
text = open('novel.txt', 'rU').read() # read everything
wordlist = text.split() # split on all whitespace
但是你还没有使用'必须以大写字母开头'的要求。是时候添加:
capwordlist = (word for word in wordlist if word.istitle())
istitle()
表示word[0].isupper() and word[1:].islower()
。这意味着'SO'.istitle() -> False
。
这可能适合你,但也许你只想要word[0].isupper()
。
如果您不能使用collections.Counter
(2.7中的新内容)
word_counter = {}
for word in capwordlist:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
top_3 = popular_words[:3]
否则这只会变成:
from collections import Counter
word_counter = Counter(capwords)
top_3 = word_counter.most_common(3) # gives `word, count` pairs!
而且:
for i in range(3):
print word_counter[top_3[i]], top_3[i]
可以是这样的:
for word in top_3:
print word_counter[word], word
答案 3 :(得分:2)
我要避免的一件事是在处理之前阅读所有单词。它会起作用,但恕我直言,如果你不需要,最好不要这样做,而你不需要。这是我的解决方案(从之前的元素中大量窃取的元素!),完成2.6.2:
import sys
# a generator function which iterates over the words in a file
def words(f):
for line in f:
for word in line.split():
yield word
# returns a generator expression filtering an iterator down to titlecase words
def titles(s):
return (word for word in s if word.istitle())
# count the titlecase words in the file
count = {}
for word in titles(words(file(sys.argv[1]))):
count[word] = count.get(word, 0) + 1
# build a list of tuples with the count for each word
countsAndWords = [(kv[1], kv[0]) for kv in count.iteritems()]
# put them in decreasing order
countsAndWords.sort()
countsAndWords.reverse()
# print the top three
for count, word in countsAndWords[:3]:
print word, count
我对计数进行了一种装饰 - 排序 - 不计量,而不是使用比较器进行排序,该计数器在计数字典中进行查找;它不那么优雅,但我相信它会更快。这可能是一件有罪的事情。
答案 4 :(得分:1)
通常,单词[0] .isupper()会在单词以大写字母开头时通知您。将它组合成列表理解(或你的循环)
[x for x in my_list if x[0].isupper()]
(假设没有空字符串)
你得到所有以大写字母开头的单词。
答案 5 :(得分:0)
由于您没有使用Python2.7且没有Counter
from collections import defaultdict
counter = defaultdict(int)
words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
for word in (word for word in words if word[0].isupper()):
counter[word]+=1
print counter
答案 6 :(得分:0)
你可以使用itertools
import itertools
words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
capwords = (word for word in words if len(word) > 1 and word[0].isupper())
capwordssorted = sorted(capwords)
wordswithcounts = ((k,len(list(g))) for (k,g) in itertools.groupby(capwordssorted))
print sorted(wordswithcounts,key=lambda x:x[1],reverse=True)[:3]