如何计算许多列表中的n-gram事件

时间:2017-04-05 14:52:57

标签: python n-gram

有人知道是否有可能从n克的词汇量中计算出来,这些词汇在几个不同的令牌列表中出现了多少次?词汇表由列表中的n克组成,其中每个唯一的n克列出一次。如果我有:

列表

['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay'] //1

['hello', 'I', 'am', 'doing', 'okay', 'are', 'you', 'okay'] //2

<type = list>

N-gram Vocabulary

('hello','I')
('I', 'am')
('am', 'doing')
('doing', 'okay')
('okay','are')
('hello', 'how')
('how', 'are')
('are','you')
('you', 'doing')
('doing', 'today')
('today', 'are')
('you', 'okay')
<type = tupels>

然后我希望输出类似于:

列表1:

('hello', 'how')1
('how', 'are')1
('are','you')2
('you', 'doing')1
('doing', 'today')1
('today', 'are')1
('you', 'okay')1

清单2:

('hello','I')1
('I', 'am')1
('am', 'doing')1
('doing', 'okay')1
('okay','are')1
('are','you')1
('you', 'okay')1

我有以下代码:

test_tokenized = [word_tokenize(i) for i in test_lower]

for test_toke in test_tokenized:

    filtered_words = [word for word in test_toke if word not in stopwords.words('english')]

    bigram = bigrams(filtered_words)

    fdist = nltk.FeatDict(bigram)

    for k,v in fdist.items():
        #print (k,v)
        occur = (k,v)

2 个答案:

答案 0 :(得分:3)

使用列表推导生成ngrams并使用collections.Counter来计算重复项:

from collections import Counter
l = ['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay']
ngrams = [(l[i],l[i+1]) for i in range(len(l)-1)]
print Counter(ngrams)

答案 1 :(得分:1)

我建议使用带有范围的for循环:

from collections import Counter
list1 = ['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay']
list2 = ['hello', 'I', 'am', 'doing', 'okay', 'are', 'you', 'okay'] 

def ngram(li):
    result = []
    for i in range(len(li)-1):
        result.append((li[i], li[i+1]))
    return Counter(result)

print(ngram(list1))
print(ngram(list2))