我正在做这个功课,而我现在卡住了。 我无法在python中编写Bigram frequency in the English language,'条件概率'?
也就是说,给定前一个标记的标记的概率等于它们的二元组的概率,或两个标记的共同出现{{ 0}},除以前一个令牌的概率。
我有一个包含许多字母的文字,然后我计算了本文中字母的概率,因此字母“a”与文本中的字母相比显示0.015%
。
这些字母来自^a-zA-Z
,我想要的是:
如何制作一个包含字母长度((字母)x(字母))的表格,以及如何计算每种情况的条件概率?
就像:
[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)]
[(a|b),(b|b),(c|b),...,(z|b),...(Z|b)]
... ...
[(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]
为此我应该计算概率,例如:如果你此时有一个字母'a',那么你获得字母'a'的几率是多少,依此类推。
我无法上手,希望你能开始我,并希望我能解决的问题很清楚。
答案 0 :(得分:0)
假设您的文件没有其他标点符号(很容易删除):
import itertools
def pairwise(s):
a,b = itertools.tee(s)
next(b)
return zip(a,b)
counts = [[0 for _ in range(52)] for _ in range(52)] # nothing has occurred yet
with open('path/to/input') as infile:
for a,b in pairwise(char for line in infile for word in line.split() for char in word): # get pairwise characters from the text
given = ord(a) - ord('a') # index (in `counts`) of the "given" character
char = ord(b) - ord('a') # index of the character that follows the "given" character
counts[given][char] += 1
# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities
totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
if not totals[given]:
continue
for i in range(len(counts[given])):
counts[given][i] /= totals[given]
我没有测试过这个,但它应该是一个好的开始
这是一个字典版本,应该更容易阅读和调试:
counts = {}
with open('path/to/input') as infile:
for a,b in pairwise(char for line in infile for word in line.split() for char in word):
given = ord(a) - ord('a')
char = ord(b) - ord('a')
if given not in counts:
counts[given] = {}
if char not in counts[given]:
counts[given][char] = 0
counts[given][char] += 1
answer = {}
for given, chardict in answer.items():
total = sum(chardict.values())
for char, count in chardict.items():
answer[given][char] = count/total
现在,answer
包含您所追求的概率。如果您想要'a'的概率,给定'b',请查看answer['b']['a']