Question

问题

我有一个问题，我有一个字和第二个可能的某些限制（例如“我_o__”）。我想要的是一个单词列表，如“rode”，“love”和“most”，并告诉我每个人跟随“我”的共同点。

我希望能够得到一个二元组列表（nextword，probability），其中nextword是一个满足正则表达式的单词，而概率就是nextword跟在第一个单词之后的概率，由（它的次数）给出在文本语料库中的第一个单词之后看到/（第一个单词出现的次数）。

像这样：

[(nextword, follow_probability("I", nextword) for nextword in findwords('.o..')]

我的方法是首先生成一个满足正则表达式的可能单词列表，然后查找每个单词的概率。第一部分很简单，但我不知道如何做第二部分。理想情况下，我可以使用一个函数为每个单词取一个参数，然后返回第二个单词跟随第一个单词的概率。

我尝试过什么

使用markovify库生成链和具有特定起始字和状态大小为1的句子
使用nltk的BigramCollocationFinder

Answer 1

尝试这样的事情：

from collections import Counter, deque
from nltk.tokenize import regexp_tokenize
import pandas as pd

def grouper(iterable, length=2):
    i = iter(iterable)
    q = deque(map(next, [i] * length))
    while True:
        yield tuple(q)
        try:
            q.append(next(i))
            q.popleft()
        except StopIteration:
            break

def tokenize(text):
    return [word.lower() for word in regexp_tokenize(text, r'\w+')]

def follow_probability(word1, word2, vec):
    subvec = vec.loc[word1]
    try:
        ct = subvec.loc[word2]
    except:
        ct = 0
    return float(ct) / (subvec.sum() or 1)

text = 'This is some training text this this'
tokens = tokenize(text)
markov = list(grouper(tokens))
vec = pd.Series(Counter(markov))

follow_probability('this', 'is', vec)

输出：

0.5

我怎样才能使用nltk来获得下一个单词的机会呢？

1 个答案: