我有一个如下文字文件。
A,B,C,D,E
A,B,C
A,B,C,E
C,D,E
C,D,E,B,A
我需要找到连续出现字符的概率。在这种情况下,B的概率发生在A
之后B occurring after A(A->B) = (No of time B occurring after A)/(No of times A occurs)
所以概率是
3/4 = 0.75
同样明智的我需要计算所有成对概率。
A->B
B->A
A->C
C->A
A->D ...etc.
我无法弄清楚如何开始实现这个东西?使用熊猫DataFrmae
也没关系。对此有何帮助?
答案 0 :(得分:0)
from collections import defaultdict
data = [['A','B','C','D','E'],
['A','B','C'],
['A','B','C','E'],
['C','D','E'],
['C','D','E','B','A']]
characters = [i for j in data for i in j]
counts = {}
combinations = defaultdict(int)
for character in set(characters):
counts[character] = characters.count(character)
for character2 in set(characters):
for entry in data:
combination = [character, character2]
if "".join(combination) in "".join(entry):
combinations[tuple(combination)] += 1
probability = {i: combinations[i]/float(counts[i[0]]) for i in combinations}
probability
{('A', 'B'): 0.75,
('B', 'A'): 0.25,
('B', 'C'): 0.75,
('C', 'D'): 0.6,
('C', 'E'): 0.2,
('D', 'E'): 1.0,
('E', 'B'): 0.25}