您好我想在python中输入文本中的单字和双字计数。 实施例
"what is your name ? what you want from me ?
You know best way to earn money is Hardwork
what is your aim ?"
输出:
sinle W.C. :
what 3
is 3
your 2
you 2
依旧......
Double W.C. :
what is 2
is your 2
your name 1
what you 1
如此...... 请发布方式来做到这一点? 我使用以下代码进行单词计数:
WS = {}
for line in text:
for wrd in line:
if wrd not in ws:
ws[wrd]=1
else:
ws[wrd]+=1
答案 0 :(得分:3)
from collections import Counter
s = "..."
words = s.split()
pairs = zip(words, words[1:])
single_words, double_words = Counter(words), Counter(pairs)
输出:
print "sinle W.C."
for word, count in sorted(single_words.items(), key=lambda x: -x[1]):
print word, count
print "double W.C."
for pair, count in sorted(double_words.items(), key=lambda x: -x[1]):
print pair, count
答案 1 :(得分:2)
import nltk
from nltk import bigrams
from nltk import trigrams
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1]
bi_tokens = bigrams(tokens)
print [(item, tokens.count(item)) for item in sorted(set(tokens))]
print [(item, bi_tokens.count(item)) for item in sorted(set(bi_tokens))]
答案 2 :(得分:0)
>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> string = "what is your name ? what you want from me ?\n
You know best way to earn money is Hardwork\n what is your aim ?"
>>> l = string.split()
>>> for i in l:
d[i]+=1
>>> d
defaultdict(<type 'int'>, {'me': 1, 'aim': 1, 'what': 3, 'from': 1, 'name': 1,
'You': 1, 'money': 1, 'is': 3, 'earn': 1, 'best': 1, 'Hardwork': 1, 'to': 1,
'way': 1, 'know': 1, 'want': 1, 'you': 1, 'your': 2, '?': 3})
>>> d2 = defaultdict(int)
>>> for i in zip(l[:-1], l[1:]):
d2[i]+=1
>>> d2
defaultdict(<type 'int'>, {('You', 'know'): 1, ('earn', 'money'): 1,
('is', 'Hardwork'): 1, ('you', 'want'): 1, ('know', 'best'): 1,
('what', 'is'): 2, ('your', 'name'): 1, ('from', 'me'): 1,
('name', '?'): 1, ('?', 'You'): 1, ('?', 'what'): 1, ('to', 'earn'): 1,
('aim', '?'): 1, ('way', 'to'): 1, ('Hardwork', 'what'): 1,
('money', 'is'): 1, ('me', '?'): 1, ('what', 'you'): 1, ('best', 'way'): 1,
('want', 'from'): 1, ('is', 'your'): 2, ('your', 'aim'): 1})
>>>
答案 3 :(得分:0)
我意识到这个问题已经有几年了。我今天写了一个小程序来计算单词文档(docx)中的单个单词。我使用 docx2txt 从 word 文档中获取文本,并使用我的第一个正则表达式来删除除字母、数字或空格之外的所有字符,并将所有字符切换为大写。我提出这个问题是因为这个问题没有得到解答。
这是我的小测试例程,以防它可能对任何人有所帮助。
mydoc = 'I:/flashdrive/pmw/pmw_py.docx'
words_all = {}
#####
import docx2txt
my_text = docx2txt.process(mydoc)
print(my_text)
my_text_org = my_text
import re
#added this code for the double words
from collections import Counter
pairs = zip(words, words[1:])
pair_list = Counter(pairs)
print('before pair listing')
for pair, count in sorted(pair_list.items(), key=lambda x: -x[1]):
#print (''.join('{} {}'.format(*pair)), count) #worked
#print(' '.join(pair), '', count) #worked
new_pair = ("{} {}")
my_pair = new_pair.format(pair[0],pair[1])
print ((my_pair), ": ", count)
#end of added code
my_text = re.sub('[\W_]+', ' ', my_text.upper(), flags=re.UNICODE)
print(my_text)
words = my_text.split()
words_org = words #just in case I may need the original version later
for i in words:
if not i in words_all:
words_all[i] = words.count(i)
for k,v in sorted(words_all.items()):
print(k, v)
print("Number of items in word list: {}".format(len(words_all)))