我需要在文本文件中显示10个最常用的单词,从最常见到最少,以及使用它的次数。我不能使用字典或计数器功能。到目前为止,我有这个:
import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
words = line.split()
for word in words:
if word not in uniques:
uniques.append(word)
for word in words:
while i<len(uniques):
i+=1
if word in uniques:
cnt += 1
print cnt
现在我想我应该查找数组'uniques'中的每个单词,看看它在此文件中重复多少次,然后将其添加到另一个计算每个单词实例的数组中。但这就是我陷入困境的地方。我不知道该怎么办。
任何帮助将不胜感激。谢谢
答案 0 :(得分:2)
你走在正确的轨道上。请注意,此算法非常慢,因为对于每个唯一的单词,它会迭代所有单词。没有散列的更快的方法将涉及构建trie。
# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()
# Get the set of unique words.
uniques = []
for word in words:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in words: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))
答案 1 :(得分:1)
上述问题可以通过使用python集合轻松完成 下面是解决方案。
from collections import Counter
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
Counter = Counter(split_it)
#print(Counter)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counter.most_common(4)
print(most_occur)
答案 2 :(得分:0)
就个人而言,我会自己实现collections.Counter
。我假设你知道这个对象是如何工作的,但如果没有,我会总结一下:
text = "some words that are mostly different but are not all different not at all"
words = text.split()
resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
我们当然可以使用key
的{{1}}关键字参数对频率进行排序,然后返回该列表中的前10个项目。但是,由于您没有实施sorted
,这对您没什么帮助。我将把这部分作为练习留给你,并告诉你如何将Counter
作为一个函数而不是一个对象来实现。
Counter
实际上并不困难。遍历iterable的每个元素。如果该元素不在def counter(iterable):
d = {}
for element in iterable:
if element in d:
d[element] += 1
else:
d[element] = 1
return d
中,请将其添加到d
,其值为d
。如果它在1
中,则递增该值。它更容易表达:
d
请注意,在您的用例中,您可能想要删除标点符号,并可能将整个事件包含在案例中(以便def counter(iterable):
d = {}
for element in iterable:
d.setdefault(element, 0) += 1
计为与someword
相同而不是两个单独的单词)。我也会留给你,但我会指出Someword
就剥离的内容进行论证,str.strip
包含你可能需要的所有标点符号。
答案 3 :(得分:0)
from string import punctuation #you will need it to strip the punctuation
import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
counter = {}
for line in txtFile:
words = line.split()
for word in words:
k = word.strip(punctuation).lower() #the The or you You counted only once
# you still have words like I've, you're, Alice's
# you could change re to are, ve to have, etc...
if "'" in k:
ks = k.split("'")
else:
ks = [k,]
#now the tally
for k in ks:
counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
print word, "\t", counter[word]
答案 4 :(得分:0)
import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.
word_counter = {}
for word in txtFile.split(" "): # split in every space.
if len(word) > 0 and word != '\r\n':
if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
word_counter[word] = 1
else:
word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1
for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
# sorts the dict by the values, from top to botton, takes the 10 top items,
print "%s: %s - %s"%(i+1,word,word_counter[word])
输出:
1: the - 1432
2: and - 734
3: to - 703
4: a - 579
5: of - 501
6: she - 466
7: it - 440
8: said - 434
9: I - 371
10: in - 338
此方法可确保计数器中仅包含字母数字和空格。并不重要。
答案 5 :(得分:0)
你也可以通过pandas数据帧来做到这一点,并以一个表格的方式获得结果:“word-its freq。”订购。
def count_words(words_list):
words_df = pn.DataFrame(words_list)
words_df.columns = ["word"]
words_df_unique = pn.DataFrame(pn.unique(words_list))
words_df_unique.columns = ["unique"]
words_df_unique["count"] = 0
i = 0
for word in pn.Series.tolist(words_df_unique.unique):
words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word])
i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)