如何在pandas中交叉制表的特定行和列上执行计算?

时间:2013-09-05 23:40:15

标签: python pandas dataframe

import pandas as pd
import numpy as np

c1 = np.repeat(['a','b'], [50, 50], axis=0)
c2 = list('xy'*50)
c3 = np.repeat(['G1','G2'], [50, 50], axis=0)
np.random.shuffle(c3)
c4=np.repeat([1,2], [50,50],axis=0)
np.random.shuffle(c4)
val = np.random.rand(100)

df = pd.DataFrame({'c1':c1, 'c2':c2, 'c3':c3, 'c4':c4, 'val':val})

table = pd.crosstab([df.c1,df.c2],[df.c3,df.c4])
c3     G1      G2    
c4      1   2   1   2
c1 c2                
a  x    3  11   5   6
   y    9   5   7   4
b  x    5   7  11   2
   y    5   5   5  10

对于每个组(G1,G2),是否可以仅为ax - bx计算ay - byc4==2并将结果存储在数据框中?:

x G1  4
y G1  0
x G2  4
y G2 -6

编辑:如果df采用这种格式,我怎么能这样做?:

c1 = np.repeat(['a','b'], [8, 8], axis=0)
c2 = list('xxxxyyyyxxxxyyyy')
c3 = ['G1','G1','G2','G2','G1','G1','G2','G2','G1','G1','G2','G2','G1','G1','G2','G2']
c4 = [1,2]*8
val = np.random.rand(16)
df = pd.DataFrame({'c1':c1,'c2':c2,'c3':c3,'c4':c4,'val':val})

1 个答案:

答案 0 :(得分:1)

你可以这样做:

In [6]: table
Out[6]:
c3     G1      G2
c4      1   2   1  2
c1 c2
a  x    6   5   8  6
   y    9   4   5  7
b  x    5  10   4  6
   y    7   4   6  8

In [7]: g = table.xs(2, level='c4', axis=1)

In [8]: g
Out[8]:
c3     G1  G2
c1 c2
a  x    5   6
   y    4   7
b  x   10   6
   y    4   8

In [9]: g.groupby(level='c2').apply(lambda x: x.iloc[0] - x.iloc[1])
Out[9]:
c3  G1  G2
c2
x   -5   0
y    0  -1

或者,将as_index=False传递给groupby并在loc中使用lambda,这是一个更有意义的恕我直言,因为您按名称而不是整数位置进行索引:

In [11]: g.groupby(level='c2', as_index=False).apply(lambda x: x.loc['a'] - x.loc['b'])
Out[11]:
c3  G1  G2
c2
x   -5   0
y    0  -1

as_indexapply仅适用于pandas git master。如果您没有使用master,那么您将获得以下内容:

In [12]: r = g.groupby(level='c2').apply(lambda x: x.loc['a'] - x.loc['b'])

In [13]: r
Out[13]:
c3     G1  G2
c2 c2
x  x   -5   0
y  y    0  -1

您可以通过重新分配index的{​​{1}}属性来删除重复的index

r

编辑:如果您有“熔化”In [28]: r.index = r.index.droplevel(0) In [29]: r Out[29]: c3 G1 G2 c2 x -5 0 y 0 -1 ,请执行以下操作:

DataFrame

每当我不确定小组在In [28]: df Out[28]: c1 c2 c3 c4 val 0 a x G1 1 0.244 1 a x G1 2 0.572 2 a x G2 1 0.837 3 a x G2 2 0.893 4 a y G1 1 0.951 5 a y G1 2 0.400 6 a y G2 1 0.391 7 a y G2 2 0.237 8 b x G1 1 0.904 9 b x G1 2 0.811 10 b x G2 1 0.536 11 b x G2 2 0.736 12 b y G1 1 0.546 13 b y G1 2 0.159 14 b y G2 1 0.735 15 b y G2 2 0.772 In [29]: g2 = df[df.c4 == 2] In [30]: g2 Out[30]: c1 c2 c3 c4 val 1 a x G1 2 0.572 3 a x G2 2 0.893 5 a y G1 2 0.400 7 a y G2 2 0.237 9 b x G1 2 0.811 11 b x G2 2 0.736 13 b y G1 2 0.159 15 b y G2 2 0.772 In [31]: gb = g2.groupby(['c2', 'c3']) In [32]: sub = gb.apply(lambda x: x.val.iloc[0] - x.val.iloc[1]) In [33]: sub Out[33]: c2 c3 x G1 -0.239 G2 0.157 y G1 0.241 G2 -0.535 dtype: float64 In [34]: sub.unstack() Out[34]: c3 G1 G2 c2 x -0.239 0.157 y 0.241 -0.535 操作中的外观时,我会迭代groupby并打印出其成分:

groupby

这些In [40]: for _, x in g2.groupby(['c2', 'c3']): ....: print x ....: print ....: c1 c2 c3 c4 val 1 a x G1 2 0.572 9 b x G1 2 0.811 c1 c2 c3 c4 val 3 a x G2 2 0.893 11 b x G2 2 0.736 c1 c2 c3 c4 val 5 a y G1 2 0.400 13 b y G1 2 0.159 c1 c2 c3 c4 val 7 a y G2 2 0.237 15 b y G2 2 0.772 中的x传递给lambda x: ...