我正在使用library(data.table)
setDT(test)[, if(any(price < 50)) .SD, prod_id]
来查找给定文本中的并置单词,如下所示:
nltk.collocations
然后,我可以打印出给定单词的并置单词(及其似然比),如下所示:
import nltk.collocations
import collections
text = 'I like the customer service. The service personnel were good. So, I would recommend the customer service of XYZ company.'
word_list = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
word_list.append(word)
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(word_list)
scored = finder.score_ngrams( bgm.likelihood_ratio )
# Group bigrams by first word in bigram.
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
prefix_keys[key[0]].append((key[1], scores))
# Sort keyed bigrams by strongest association.
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
可以看出,print('Words collocated with "customer" are:', prefix_keys['customer'])
>>> Words collocated with "customer" are: [('service', 9.949042176926831)]
print('Words collocated with "service" are:', prefix_keys['service'])
>>> Words collocated with "service" are: [('of', 4.4947649141916255), ('personnel', 4.4947649141916255), ('.', 1.0572102767208427)]
被显示为service
的并置词,但是customer
没有被显示为customer
的并置词。因此,似乎当NLTK说“并置”时,它们实际上的意思是“后面的单词”。
但是并置应该意味着正向和反向并置;也就是说,service
紧随customer
还是service
紧随service
都没关系,它们都应显示为并置。
那么,我如何找到实际的搭配,而不仅仅是“跟在后面的单词”搭配?