Pandas sumif重复列名

时间:2017-04-20 21:32:10

标签: python pandas dataframe

通过下面的df3列对df2列进行求和的最佳方法是什么?

df = pd.DataFrame(np.random.rand(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.random.rand(15).reshape((5,3)),index = ['A','B','C','D','E'])
df2 = pd.concat([df,df1],axis=1)
df3 =  pd.DataFrame(np.random.rand(25).reshape((5,5)),columns = np.arange(5),index = ['A','B','C','D','E'])

答案是df3的形状。

为清晰起见编辑:

df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(15).reshape((5,3))*2,index = ['A','B','C','D','E'],columns = [1,3,4])
df2 = pd.concat([df,df1],axis=1)
df3 =  pd.DataFrame(np.empty((5,5)),columns = np.arange(5),index = ['A','B','C','D','E'])
print(df2)
     0    1    2    3    4    1    3    4
A  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0
B  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0
C  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0
D  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0
E  1.0  1.0  1.0  1.0  1.0  2.0  2.0  2.0

期望的结果是:

       0       1       2       3       4
A    1.0     3.0     1.0     3.0     3.0 
B    1.0     3.0     1.0     3.0     3.0 
C    1.0     3.0     1.0     3.0     3.0 
D    1.0     3.0     1.0     3.0     3.0 
E    1.0     3.0     1.0     3.0     3.0 

4 个答案:

答案 0 :(得分:6)

您可以按列对DF进行分组:

In [57]: df2.groupby(axis=1, by=df2.columns).sum()
Out[57]:
     0    1    2    3    4
A  1.0  3.0  1.0  3.0  3.0
B  1.0  3.0  1.0  3.0  3.0
C  1.0  3.0  1.0  3.0  3.0
D  1.0  3.0  1.0  3.0  3.0
E  1.0  3.0  1.0  3.0  3.0

您可以明确指定轴名称:

In [58]: df2.groupby(axis='columns', by=df2.columns).sum()
Out[58]:
     0    1    2    3    4
A  1.0  3.0  1.0  3.0  3.0
B  1.0  3.0  1.0  3.0  3.0
C  1.0  3.0  1.0  3.0  3.0
D  1.0  3.0  1.0  3.0  3.0
E  1.0  3.0  1.0  3.0  3.0

a short version from @piRSquared

df2.groupby(df2.columns, 1).sum()

答案 1 :(得分:2)

让我们使用T转置,groupbysum

 df2.T.groupby(level=0).sum().T

原创df2:

          0         1         2         3         4         0         1  \
A  0.627278  0.008150  0.285077  0.931831  0.683035  0.691318  0.873139   
B  0.246861  0.108021  0.903743  0.030373  0.870753  0.143835  0.251623   
C  0.367309  0.551530  0.193623  0.704314  0.136061  0.102401  0.287334   
D  0.580771  0.592600  0.949666  0.806875  0.288331  0.794173  0.034380   
E  0.088984  0.838401  0.988919  0.636134  0.353484  0.584571  0.090235   

          2  
A  0.763687  
B  0.735570  
C  0.405304  
D  0.446789  
E  0.542930 

new_df2 = df2.T.groupby(level=0).sum().T
print(new_df2)

输出新的df2:

          0         1         2         3         4
A  1.318595  0.881289  1.048764  0.931831  0.683035
B  0.390697  0.359644  1.639314  0.030373  0.870753
C  0.469710  0.838864  0.598927  0.704314  0.136061
D  1.374944  0.626980  1.396455  0.806875  0.288331
E  0.673555  0.928636  1.531849  0.636134  0.353484

答案 2 :(得分:1)

解决方案1 ​​
numpy.dot + pandas.get_dummies

cols = df2.columns.values
pd.DataFrame(
    df2.values.dot(pd.get_dummies(cols).values),
    df2.index, pd.unique(df2.columns.values)
)

   0  1  2  3  4
A  1  3  1  3  3
B  1  3  1  3  3
C  1  3  1  3  3
D  1  3  1  3  3
E  1  3  1  3  3

解决方案2
numpy.einsum + pandas.get_dummies

cols = df2.columns.values
pd.DataFrame(
    np.einsum('ij,jk->ik', df2.values, pd.get_dummies(cols).values),
    df2.index, pd.unique(df2.columns.values)
)

   0  1  2  3  4
A  1  3  1  3  3
B  1  3  1  3  3
C  1  3  1  3  3
D  1  3  1  3  3
E  1  3  1  3  3

天真的时间

enter image description here

设置

df2 = pd.DataFrame(
    [[1, 1, 1, 1, 1, 2, 2, 2]],
    list('ABCDE'),
    [0, 1, 2, 3, 4, 1, 3, 4]
)

答案 3 :(得分:0)

这就是你的意思:

new_df = pd.DataFrame()
for c in df3.columns:
    try:
        new_df[c] = [sum(x) for x in df2[c].values]
    except:
        new_df[c] = df2[c].values