Question

我正在做这个功课，而我现在卡住了。我无法在python中编写Bigram frequency in the English language，'条件概率'？

也就是说，给定前一个标记 $P()$ 的标记 $W_n$ 的概率 $W_{n-1}$ 等于它们的二元组的概率，或两个标记的共同出现{{ 0}}，除以前一个令牌的概率。

我有一个包含许多字母的文字，然后我计算了本文中字母的概率，因此字母“a”与文本中的字母相比显示0.015%。

这些字母来自^a-zA-Z，我想要的是：
如何制作一个包含字母长度（（字母）x（字母））的表格，以及如何计算每种情况的条件概率？

就像：

[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)]
 [(a|b),(b|b),(c|b),...,(z|b),...(Z|b)]
                    ...       ...
 [(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]

为此我应该计算概率，例如：如果你此时有一个字母'a'，那么你获得字母'a'的几率是多少，依此类推。

我无法上手，希望你能开始我，并希望我能解决的问题很清楚。

Answer 1

假设您的文件没有其他标点符号（很容易删除）：

import itertools

def pairwise(s):
    a,b = itertools.tee(s)
    next(b)
    return zip(a,b)

counts = [[0 for _ in range(52)] for _ in range(52)]  # nothing has occurred yet
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):  # get pairwise characters from the text
        given = ord(a) - ord('a')  # index (in `counts`) of the "given" character
        char = ord(b) - ord('a')   # index of the character that follows the "given" character
        counts[given][char] += 1

# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities

totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
    if not totals[given]:
        continue
    for i in range(len(counts[given])):
        counts[given][i] /= totals[given]

我没有测试过这个，但它应该是一个好的开始

这是一个字典版本，应该更容易阅读和调试：

counts = {}
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):
        given = ord(a) - ord('a')
        char = ord(b) - ord('a')
        if given not in counts:
            counts[given] = {}
        if char not in counts[given]:
            counts[given][char] = 0
        counts[given][char] += 1

answer = {}
for given, chardict in answer.items():
    total = sum(chardict.values())
    for char, count in chardict.items():
        answer[given][char] = count/total

现在，answer包含您所追求的概率。如果您想要'a'的概率，给定'b'，请查看answer['b']['a']

如何在python中将bigram编程为表？

1 个答案: