如何计算单词的频率以及python中输入文本的双字数?

时间:2012-10-10 16:36:12

标签: python regex

您好我想在python中输入文本中的单字和双字计数。 实施例

"what is your name ? what you want from me ?
 You know best way to earn money is Hardwork 
 what is your aim ?"

输出:

sinle W.C. : 
what   3
 is    3
 your  2
you    2

依旧......

Double W.C. :
what is 2
is your 2
your name 1
what you 1

如此...... 请发布方式来做到这一点? 我使用以下代码进行单词计数:

WS = {}

for line in text:

for wrd in line:

    if wrd not in ws:

        ws[wrd]=1

    else:

        ws[wrd]+=1

4 个答案:

答案 0 :(得分:3)

from collections import Counter

s = "..."

words = s.split()
pairs = zip(words, words[1:])

single_words, double_words = Counter(words), Counter(pairs)

输出:

print "sinle W.C."
for word, count in sorted(single_words.items(), key=lambda x: -x[1]):
    print word, count

print "double W.C."
for pair, count in sorted(double_words.items(), key=lambda x: -x[1]):
    print pair, count

答案 1 :(得分:2)

import nltk
from nltk import bigrams
from nltk import trigrams

tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1]
bi_tokens = bigrams(tokens)

print [(item, tokens.count(item)) for item in sorted(set(tokens))]
print [(item, bi_tokens.count(item)) for item in sorted(set(bi_tokens))]

答案 2 :(得分:0)

这是有效的。使用defaultdict。 python 2.6

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> string = "what is your name ? what you want from me ?\n
    You know best way to earn money is Hardwork\n what is your aim ?"
>>> l = string.split()
>>> for i in l:
    d[i]+=1

>>> d
defaultdict(<type 'int'>, {'me': 1, 'aim': 1, 'what': 3, 'from': 1, 'name': 1, 
    'You': 1, 'money': 1, 'is': 3, 'earn': 1, 'best': 1, 'Hardwork': 1, 'to': 1, 
    'way': 1, 'know': 1, 'want': 1, 'you': 1, 'your': 2, '?': 3})
>>> d2 = defaultdict(int)
>>> for i in zip(l[:-1], l[1:]):
    d2[i]+=1

>>> d2
defaultdict(<type 'int'>, {('You', 'know'): 1, ('earn', 'money'): 1, 
    ('is', 'Hardwork'): 1, ('you', 'want'): 1, ('know', 'best'): 1, 
    ('what', 'is'): 2, ('your', 'name'): 1, ('from', 'me'): 1, 
    ('name', '?'): 1, ('?', 'You'): 1, ('?', 'what'): 1, ('to', 'earn'): 1, 
    ('aim', '?'): 1, ('way', 'to'): 1, ('Hardwork', 'what'): 1, 
    ('money', 'is'): 1, ('me', '?'): 1, ('what', 'you'): 1, ('best', 'way'): 1,
    ('want', 'from'): 1, ('is', 'your'): 2, ('your', 'aim'): 1})
>>> 

答案 3 :(得分:0)

我意识到这个问题已经有几年了。我今天写了一个小程序来计算单词文档(docx)中的单个单词。我使用 docx2txt 从 word 文档中获取文本,并使用我的第一个正则表达式来删除除字母、数字或空格之外的所有字符,并将所有字符切换为大写。我提出这个问题是因为这个问题没有得到解答。

这是我的小测试例程,以防它可能对任何人有所帮助。

mydoc = 'I:/flashdrive/pmw/pmw_py.docx'

words_all = {}

#####

import docx2txt

my_text = docx2txt.process(mydoc)
print(my_text)

my_text_org = my_text

import re

    #added this code for the double words

from collections import Counter

pairs = zip(words, words[1:])
pair_list = Counter(pairs)

print('before pair listing')

for pair, count in sorted(pair_list.items(), key=lambda x: -x[1]):
   #print (''.join('{} {}'.format(*pair)), count) #worked
   #print(' '.join(pair), '', count) #worked  
  
   new_pair = ("{} {}")
   my_pair = new_pair.format(pair[0],pair[1])
   print ((my_pair), ": ", count)
  
#end of added code

my_text = re.sub('[\W_]+', ' ', my_text.upper(), flags=re.UNICODE)
print(my_text)

words = my_text.split()

words_org = words #just in case I may need the original version later


for i in words:  
     if not i in words_all:
         words_all[i] = words.count(i)
          
  
for k,v in sorted(words_all.items()):
     print(k, v)

print("Number of items in word list: {}".format(len(words_all)))