分析单词出现频率

时间:2018-08-26 10:36:16

标签: python python-3.x

我需要实现一个能够分析单词出现频率的功能。我尝试了下面的代码,但输出似乎过于分散和重复。有没有一种方法可以将这些数据分组/打包在一起,并且将出现的次数显示一次而不是多次?

file = "PartA"
f = open(file, 'r')
wordstring = f.read()

wordlist = wordstring.split()

wordfreq = []

for w in wordlist:
    wordfreq.append(wordlist.count(w))

print("String\n" + wordstring +"\n")
print("list\n"+ str(wordlist) + "\n")
print("Frequencies\n" + str(wordfreq) + "\n")

我的输出:

String
hey there
This is Joey
how is it going
it it it it it it it it it it it it
is is is is is is


list
['hey', 'there', 'This', 'is', 'Joey', 'how', 'is', 'it', 'going', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it',
 'it', 'it', 'it', 'is', 'is', 'is', 'is', 'is', 'is']

Frequencies
[1, 1, 1, 8, 1, 1, 8, 13, 1, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 8, 8, 8, 8, 8, 8]

3 个答案:

答案 0 :(得分:1)

Counter是您要寻找的。

from collections import Counter

s  = "hey there This is Joey how is it going it it it it it it it it it it it it is is is is is is"
counter=Counter(s.split())
  

Counter({'it':13,'is':8,'hey':1,'there':1,'This':1,'Joey':1,'how':1,'going ':1})

还请注意,对每个元素都使用count方法。这会导致O(n ^ 2)的复杂性,其中n是列表的长度,因为您多次考虑同一单词。通过仅对不同的单词使用count(),您可以在O(n * k)中做到这一点,其中k是不同元素的数量(最坏的情况下仍为O(n ^ 2))。

只需使用字典,您就可以在线性时间内解决您的问题。

答案 1 :(得分:0)

我同意最好的方法是在集合中使用实现,如果您需要自己实现(也许是家庭作业?),则可以使用字典

word_freq = {}
with open(file, 'r') as f
   word_list = f.read().split()
   for word in word_list:
      word_freq.setdefault(word, 0)
      word_freq[word] += 1

print(word_freq)

如果要将其放入函数中(如注释所示),则可以执行以下操作:

def word_count(filename, n_words=-1):
'''return list of tuples of most frequent n words in the given file'''
   word_freq = {}
      with open(file, 'r') as f
         word_list = f.read().split()
         for word in word_list:
            word_freq.setdefault(word, 0)
            word_freq[word] += 1
         if n_words < 0: n_words = len(word_freq)
## generate pairs of words and frequencies sorted by frequency
   pairs = zip(sorted(word_freq, key=word_freq.__getitem__), 
               sorted(word_freq.values()))
## return the first n of them as a list of tuples
   return list(pairs)[:n_words]

我希望这是您想要的,但是请查看basics on dictionary sorting并阅读与您的问题相关的主题,以便您可以在最初的问题中更准确地表达特定问题。

答案 2 :(得分:0)

尝试在下面实现此代码,因此您无需再数相同的单词:

wordSet=set(wordList)
wordfreq=[]
for word in wordSet:
    wordfreq.append(wordlist.count(w))