用于标记文本的Python NLTK Collocations

时间:2013-11-02 17:24:54

标签: python nltk

我不确定这是否可行,但我想我会问以防万一。假设您有一个“body | tags”形式的示例数据集,例如

"I went to the store and bought some bread" | shopping food

我想知道是否有办法使用NLTK Collocations来计算正文词和标签词在数据集中共存的次数。一个例子可能是(“面包”,“食物”,598),其中“面包”是一个正文词,“食物”是一个标记词,598是它们在数据集中共存的次数

1 个答案:

答案 0 :(得分:0)

不使用NLTK,您可以这样做:

from collections import Counter
from itertools import product

documents = '''"foo bar is not a sentence" | tag1
"bar bar black sheep is not a real sheep" | tag2
"what the bar foo is not a foo bar" | tag1'''

documents = [i.split('|')[0].strip('" ') for i in documents.split('\n')]

collocations = Counter()

for i in documents:
    # Get all the possible word collocations with product
    # NOTE: this includes a token with itself. so we need 
    #       to remove the count for the token with itself.
    x = Counter(list(product(i.split(),i.split()))) \
            - Counter([(i,i) for i in i.split()])
    collocations+=x


for i in collocations:
    print i, collocations[i]

您将遇到如何计算句子中相同单词的搭配的问题,例如,

  酒吧黑羊不是真正的羊

('bar','bar')的搭配计数是多少?它是2的1?上面的代码给出2,因为第一个条与第二个条并置,第二个条与第一个条并置。