我想知道如何使用Counter()来计算带有列表traning_data的unigram,bigram,cooc和wordcount。
我是蟒蛇新人,请耐心等待。谢谢!
您需要实施HMM postagger的两个部分。
维特比解码 这是代码:
from collections import Counter
from math import log
class HMM(object):
def __init__(self, epsilon=1e-5, training_data=None):
self.epsilon = epsilon
if training_data is not None:
self.fit(training_data)
def fit(self, training_data):
'''
Counting the number of unigram, bigram, cooc and wordcount from the training
data.
Parameters
----------
training_data: list
A list of training data, each element is a tuple with words and postags.
'''
self.unigram = Counter() # The count of postag unigram, e.g. unigram['NN']=5
self.bigram = Counter() # The count of postag bigram, e.g. bigram[('PRP', 'VV')]=1
self.cooc = Counter() # The count of word, postag, e.g. cooc[('I', 'PRP')]=1
self.wordcount = Counter() # The count of word, e.g. word['I']=1
print('building HMM model ...')
for words, tags in training_data:
# Your code here! You need to implement the ngram counting part. Please count
# - unigram
# - bigram
# - cooc
# - wordcount
print('HMM model is built.')
self.postags = [k for k in self.unigram]
这是training_dataset,预期结果如下:
# The tiny example.
training_dataset = [(['dog', 'chase', 'cat'], ['NN', 'VV', 'NN']),
(['I', 'chase', 'dog'], ['PRP', 'VV', 'NN']),
(['cat', 'chase', 'mouse'], ['NN', 'VV', 'NN'])
]
hmm = HMM(training_data=training_dataset)
# Testing if the parameter are correctly estimated.
assert hmm.unigram['NN'] == 5
assert hmm.bigram['VV', 'NN'] == 3
assert hmm.bigram['NN', 'VV'] == 2
assert hmm.cooc['dog', 'NN'] == 2
答案 0 :(得分:0)
将Counter()
与列表结合使用非常简单。
Counter.update()
正是您所需要的。
from nltk.util import bigrams
...
for words, tags in training_data:
self.unigram.update(tags)
self.bigram.update(bigrams(tags))
self.cooc.update(zip(words,tags))
self.wordcount.update(words)
...