如何从文本中提取单词联想(正向和反向搭配)?

时间:2019-01-22 03:28:39

标签: python nlp nltk n-gram collocation

我正在使用library(data.table) setDT(test)[, if(any(price < 50)) .SD, prod_id] 来查找给定文本中的并置单词,如下所示:

nltk.collocations

然后,我可以打印出给定单词的并置单词(及其似然比),如下所示:

import nltk.collocations
import collections

text = 'I like the customer service. The service personnel were good. So, I would recommend the customer service of XYZ company.'

word_list = []
for sent in nltk.sent_tokenize(text):
    for word in nltk.word_tokenize(sent):
        word_list.append(word)

bgm    = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(word_list)
scored = finder.score_ngrams( bgm.likelihood_ratio  )

# Group bigrams by first word in bigram.                                        
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
    prefix_keys[key[0]].append((key[1], scores))

# Sort keyed bigrams by strongest association.                                  
for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

可以看出,print('Words collocated with "customer" are:', prefix_keys['customer']) >>> Words collocated with "customer" are: [('service', 9.949042176926831)] print('Words collocated with "service" are:', prefix_keys['service']) >>> Words collocated with "service" are: [('of', 4.4947649141916255), ('personnel', 4.4947649141916255), ('.', 1.0572102767208427)] 被显示为service的并置词,但是customer没有被显示为customer的并置词。因此,似乎当NLTK说“并置”时,它们实际上的意思是“后面的单词”。

但是并置应该意味着正向和反向并置;也就是说,service紧随customer还是service紧随service都没关系,它们都应显示为并置。

那么,我如何找到实际的搭配,而不仅仅是“跟在后面的单词”搭配?

0 个答案:

没有答案