因此,我得到了如下的令牌列表,tokens = ["<S>", "Hello", "World", "How", "Are", "You", "Hello", "Goop", "I", "Are", "Good"]
并且我想用 Backoff 训练N-Gram模型,这意味着我的输出应类似于-
output["token1 token2"] = {"token0" : Probability of sentence <token1 token2 token0>, "token1": Probability value of sentence <token1 token2 token1>, and so on}
例如:output["token1 token2"] = {"token0" : 0.02, "token1" : 0.3, and so on}
还应包含键output["token1"]
和output[""]
(英文字母)。我的方法定义看起来像这样-
model(tokens, n):
output = defaultdict(Counter)
for i in range(len(tokens) - order):
history, word = tokens[i:i+order], tokens[i+order]
<How to compute probabilities here & store it in output?>
return output
当前,我正在尝试将我的输出初始化为计数器,但是我绝对会停留在如何处理 backoff 并在单个循环内生成概率的问题上。