Question

我有一个数据框，列出了第x列中的初始状态和第0列中的下一个状态之间的所有转换。

test = pd.DataFrame([['a','b',1],['a','c',1],['b','c',2],['d','a',1],['d','e',3]], columns = ['x','y','counts'])

我试图创建一个转换矩阵（数据帧），显示col x和col y中所有状态之间转换的概率。像这样：

    [a ] [b]  [c]  [d]  [e]

[a] .1  .2  .3  .4  0
[b] .0  .0 .25  .75  0
[c] .0  .0  .0  .0  0
[d] .25 .25 .25  .25  0
[e]  0  0   0  0  0

不幸的是，在我的数据集中，如果没有从价值转换到＆＃39; a＆＃39;如果没有记录，任何分组和卸载的尝试都会在行或列中给出缺失的值。

我得到了什么：

   a   b   c   e

a   0 .25 .25  0
b   0   0  1   0
d  .25  0  0  .75

如何在两个轴上获得a到e的所有值？

Answer 1

0需要填写缺失值以查找缺少的类别：

test = test.reindex(index=list('abcde'), columns=list('abcde'), fill_value=0)

reindex也可Multiindex - reindex所有唯一值：

pivot = test.groupby(['x','y'])['counts'].sum() / test.groupby(['x'])['counts'].sum()

vals = np.unique(test[['x', 'y']].values)
print (vals)
['a' 'b' 'c' 'd' 'e']

mux = pd.MultiIndex.from_product([vals, vals])
final = pivot.reindex(mux, fill_value=0).unstack(fill_value=0)
print (final)
      a    b    c    d     e
a  0.00  0.5  0.5  0.0  0.00
b  0.00  0.0  1.0  0.0  0.00
c  0.00  0.0  0.0  0.0  0.00
d  0.25  0.0  0.0  0.0  0.75
e  0.00  0.0  0.0  0.0  0.00

Answer 2

谢谢jezrael。我只想在reindex中添加生成缺失索引值列表的答案，我使用了列和行值之间的联合。

##Generate the initial datafame
test = pd.DataFrame([['a','b',1],['a','c',1],['b','c',2],['d','a',1],['d','e',3]], columns = ['x','y','counts'])

## pivot to get the probabilities of the transitions.
pivot = test.groupby(['x','y'])['counts'].sum() / test2.groupby(['x'])['counts'].sum()

##unstack to get the values as a dataframe and fill with zeroes for existing transitions.
temp=pivot.unstack().fillna(0)

##fill in the missing values by reindexing on the union of values between x and y 
final=temp.reindex(index=list(set(test['x']) | set(test['y'])), columns=list(set(test['x']) | set(test['y'])), fill_value=0)

创建列值之间所有组合的数据框（即使没有观察）

2 个答案: