10,字符串Python中最常见的单词

时间:2014-12-06 01:30:40

标签: python

我需要在文本文件中显示10个最常用的单词,从最常见到最少,以及使用它的次数。我不能使用字典或计数器功能。到目前为止,我有这个:

import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
    words = line.split()
    for word in words:
        if word not in uniques:
            uniques.append(word)
for word in words:
    while i<len(uniques):
        i+=1
        if word in uniques:
             cnt += 1
print cnt

现在我想我应该查找数组'uniques'中的每个单词,看看它在此文件中重复多少次,然后将其添加到另一个计算每个单词实例的数组中。但这就是我陷入困境的地方。我不知道该怎么办。

任何帮助将不胜感激。谢谢

6 个答案:

答案 0 :(得分:2)

你走在正确的轨道上。请注意,此算法非常慢,因为对于每个唯一的单词,它会迭代所有单词。没有散​​列的更快的方法将涉及构建trie

# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()

# Get the set of unique words.
uniques = []
for word in words:
  if word not in uniques:
    uniques.append(word)

# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
  count = 0              # Initialize the count to zero.
  for word in words:     # Iterate over the words.
    if word == unique:   # Is this word equal to the current unique?
      count += 1         # If so, increment the count
  counts.append((count, unique))

counts.sort()            # Sorting the list puts the lowest counts first.
counts.reverse()         # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
  count, word = counts[i]
  print('%s %d' % (word, count))

答案 1 :(得分:1)

上述问题可以通过使用python集合轻松完成 下面是解决方案。

from collections import Counter

data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \

# split() returns list of all the words in the string
split_it = data_set.split()

# Pass the split_it list to instance of Counter class.
Counter = Counter(split_it)
#print(Counter)

# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counter.most_common(4)
print(most_occur)

答案 2 :(得分:0)

就个人而言,我会自己实现collections.Counter。我假设你知道这个对象是如何工作的,但如果没有,我会总结一下:

text = "some words that are mostly different but are not all different not at all"

words = text.split()

resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}

我们当然可以使用key的{​​{1}}关键字参数对频率进行排序,然后返回该列表中的前10个项目。但是,由于您没有实施sorted,这对您没什么帮助。我将把这部分作为练习留给你,并告诉你如何将Counter作为一个函数而不是一个对象来实现。

Counter
实际上并不困难。遍历iterable的每个元素。如果该元素不在def counter(iterable): d = {} for element in iterable: if element in d: d[element] += 1 else: d[element] = 1 return d 中,请将其添加到d,其值为d。如果它在1中,则递增该值。它更容易表达:

d

请注意,在您的用例中,您可能想要删除标点符号,并可能将整个事件包含在案例中(以便def counter(iterable): d = {} for element in iterable: d.setdefault(element, 0) += 1 计为与someword相同而不是两个单独的单词)。我也会留给你,但我会指出Someword就剥离的内容进行论证,str.strip包含你可能需要的所有标点符号。

答案 3 :(得分:0)

from string import punctuation #you will need it to strip the punctuation

import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")

counter = {}

for line in txtFile:
    words = line.split()
    for word in words:
        k = word.strip(punctuation).lower() #the The or you You counted only once
        # you still have words like I've, you're, Alice's
        # you could change re to are, ve to have, etc...
        if "'" in k:
            ks = k.split("'")
        else:
            ks = [k,]
        #now the tally
        for k in ks:
            counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
    print word, "\t", counter[word]

答案 4 :(得分:0)

import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.

word_counter = {}
for word in txtFile.split(" "): # split in every space.
    if len(word) > 0 and word != '\r\n':
        if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
            word_counter[word] = 1
        else:
            word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1

for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
    # sorts the dict by the values, from top to botton, takes the 10 top items,
    print "%s: %s - %s"%(i+1,word,word_counter[word])

输出:

1: the - 1432 2: and - 734 3: to - 703 4: a - 579 5: of - 501 6: she - 466 7: it - 440 8: said - 434 9: I - 371 10: in - 338

此方法可确保计数器中仅包含字母数字和空格。并不重要。

答案 5 :(得分:0)

你也可以通过pandas数据帧来做到这一点,并以一个表格的方式获得结果:“word-its freq。”订购。

def count_words(words_list):
 words_df = pn.DataFrame(words_list)
 words_df.columns = ["word"]
 words_df_unique = pn.DataFrame(pn.unique(words_list))
 words_df_unique.columns = ["unique"]
 words_df_unique["count"] = 0
 i = 0
 for word in pn.Series.tolist(words_df_unique.unique):
     words_df_unique.iloc[i, 1] =  len(words_df.word[words_df.word == word])
     i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)