Question

您如何在文本中找到搭配？搭配是一系列非常频繁出现的单词。 python有内置的func bigrams，返回单词对。

>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>

剩下的就是根据单个词的频率找到更频繁发生的双字母。任何想法如何把它放在代码中？

Answer 1

试试NLTK。您将主要对nltk.collocations.BigramCollocationFinder感兴趣，但这里有一个快速演示，向您展示如何开始：

>>> import nltk
>>> def tokenize(sentences):
...     for sent in nltk.sent_tokenize(sentences.lower()):
...         for word in nltk.word_tokenize(sent):
...             yield word
... 

>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))

这个小部分没有，但是这里有：

>>> text.collocations(num=20)
Building collocations list

Answer 2

这是一些代码，它采用小写单词列表并返回所有双字母组合的列表及其各自的计数，从最高计数开始。不要将此代码用于大型列表。

from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
    count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)

（words_iter是为了避免像在izip(words, words[1:])中那样复制整个单词列表

Answer 3

import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
next(nextword)
freq=Counter(zip(words,nextword))
print(freq)

Answer 4

并置是一系列令牌，在解析时可以更好地将其视为单个令牌。 “红鲱鱼”具有无法从其组成部分衍生出来的含义。从语料库中导出一组有用的搭配包括用一些统计量（n-gram频率，互信息，对数似然等）对n-gram进行排序，然后进行明智的手动编辑。

您似乎忽略的要点：

（1）语料库必须相当大......试图从你看来建议的一个句子中获得搭配毫无意义。

（2）n可以大于2 ...分析关于20世纪中国历史的文章，将抛出像“毛泽东”和“谢彤”这样的“重要”双子座。

你到底想要实现什么目标？你到目前为止写了什么代码？

Answer 5

同意Tim McNamara使用nltk和unicode的问题。但是，我非常喜欢文本类 - 有一个hack可以用来将collocations作为列表，我发现它看着source code。显然，无论何时调用collocations方法，它都会将其保存为类变量！

    import nltk
    def tokenize(sentences):
        for sent in nltk.sent_tokenize(sentences.lower()):
            for word in nltk.word_tokenize(sent):                 
                yield word


    text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
    text.collocations(num=20)
    collocations = [" ".join(el) for el in list(text._collocations)]

享受！

如何在文本中找到搭配，python

5 个答案: