如何在python中将bigram编程为表?

时间:2015-01-14 20:14:52

标签: python list dictionary markov-chains

我正在做这个功课,而我现在卡住了。 我无法在python中编写Bigram frequency in the English language,'条件概率'?

  

     

也就是说,给定前一个标记P()的标记W_n的概率W_{n-1}等于它们的二元组的概率,或两个标记的共同出现{{ 0}},除以前一个令牌的概率。

我有一个包含许多字母的文字,然后我计算了本文中字母的概率,因此字母“a”与文本中的字母相比显示0.015%

这些字母来自^a-zA-Z,我想要的是:
如何制作一个包含字母长度((字母)x(字母))的表格,以及如何计算每种情况的条件概率?

就像:

[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)]
 [(a|b),(b|b),(c|b),...,(z|b),...(Z|b)]
                    ...       ...
 [(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]

为此我应该计算概率,例如:如果你此时有一个字母'a',那么你获得字母'a'的几率是多少,依此类推。

我无法上手,希望你能开始我,并希望我能解决的问题很清楚。

1 个答案:

答案 0 :(得分:0)

假设您的文件没有其他标点符号(很容易删除):

import itertools

def pairwise(s):
    a,b = itertools.tee(s)
    next(b)
    return zip(a,b)

counts = [[0 for _ in range(52)] for _ in range(52)]  # nothing has occurred yet
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):  # get pairwise characters from the text
        given = ord(a) - ord('a')  # index (in `counts`) of the "given" character
        char = ord(b) - ord('a')   # index of the character that follows the "given" character
        counts[given][char] += 1

# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities

totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
    if not totals[given]:
        continue
    for i in range(len(counts[given])):
        counts[given][i] /= totals[given]

我没有测试过这个,但它应该是一个好的开始

这是一个字典版本,应该更容易阅读和调试:

counts = {}
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):
        given = ord(a) - ord('a')
        char = ord(b) - ord('a')
        if given not in counts:
            counts[given] = {}
        if char not in counts[given]:
            counts[given][char] = 0
        counts[given][char] += 1

answer = {}
for given, chardict in answer.items():
    total = sum(chardict.values())
    for char, count in chardict.items():
        answer[given][char] = count/total

现在,answer包含您所追求的概率。如果您想要'a'的概率,给定'b',请查看answer['b']['a']