Question

我正在使用NLTK对一些不同的文档进行一些分析。这些文件的内容意味着它们都倾向于以相同的标记结束并开始。

我将文档标记为列表列表，然后使用BigramCollocationFinder.from_documents创建查找程序。当我按原始频率对ngrams进行评分时，我注意到最常见的出现是结束字符/开始字符。这表明它正在将所有文档合并为一个并在整个批次中找到我不想要的ngram。

代码示例：

line_tokenizer = nltk.RegexpTokenizer('\{|\}|[^,"}]+')
seqs = ["{B,C}", "{B,A}", "{A,B,C}"]
documents = [line_tokenizer.tokenize(s) for s in seqs]
finder = BigramCollocationFinder.from_documents(documents)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.score_ngrams(bigram_measures.raw_freq))

这导致以下输出：

[(('B', 'C'), 0.15384615384615385), 
 (('C', '}'), 0.15384615384615385), 
 (('{', 'B'), 0.15384615384615385), 
 (('}', '{'), 0.15384615384615385), 
 (('A', 'B'), 0.07692307692307693), 
 (('A', '}'), 0.07692307692307693), 
 (('B', 'A'), 0.07692307692307693), 
 (('{', 'A'), 0.07692307692307693)]

ngram} {在列表中显示它不应该} {永远不会出现在彼此旁边。

是否有另一种方法来解决此问题以避免} {显示在列表中？

Answer 1

我相信你想保留像{A和C}这样的双字母，因为有时候知道某些单词总是出现在句子的结尾或开头是很好的。所以黑客：

从}{移除bigram_measure双字母组合，然后使用1-prob('}{')重新计算其他双字母组的概率。

import nltk
line_tokenizer = nltk.RegexpTokenizer('\{|\}|[^,"}]+')
seqs = ["{B,C}", "{B,A}", "{A,B,C}"]
documents = [line_tokenizer.tokenize(s) for s in seqs]
finder = nltk.collocations.BigramCollocationFinder.from_documents(documents)
bigram_measures = nltk.collocations.BigramAssocMeasures()
# Put bigram measures into a dict for easy access
x = dict(finder.score_ngrams(bigram_measures.raw_freq))

# Re-adjust such that the score of 
# each bigram is divided by 1-prob('}{')
newmax = 1- x[('}','{')]

# Remove "}{" from bigrams.
del x[('}','{')]

# Recalcuate prob for each bigram with newmax
y =[(i,j/float(newmax)) for i,j in x.iteritems()]
print y

[(('B', 'C'), 0.18181818181818182), (('C', '}'), 0.18181818181818182), (('B', 'A'), 0.09090909090909091), (('{', 'A'), 0.09090909090909091), (('{', 'B'), 0.18181818181818182),  (('A', 'B'), 0.09090909090909091), (('A', '}'), 0.09090909090909091)]

我可以获得BigramCollocationFinder（nltk）来尊重文档边界吗？

1 个答案: