汇总Pandas DataFrame中的行

时间:2014-03-27 04:28:57

标签: python numpy pandas

我有以下几行:

    ColumnID  MenuID  QuestionID  ResponseCount       RowID  SourceColumnID  SourceRowID  SourceVariationID
22        -2      -2   319276487             28  3049400354      3049400356   3049400365         3049400365
23        -2      -2   319276487             31  3049400354      3049400356   3049400365         3049400365
24        -2      -2   319276487             37  3049400354      3049400356   3049400365         3049400365
25        -2      -2   319276487             28  3049400353      3049400357   3049400365         3049400365
26        -2      -2   319276487             45  3049400353      3049400357   3049400365         3049400365
27        -2      -2   319276487             46  3049400353      3049400357   3049400365         3049400365
28        -2      -2   319276487             26  3049400353      3049400358   3049400365         3049400365
29        -2      -2   319276487             33  3049400353      3049400358   3049400365         3049400365
30        -2      -2   319276487             39  3049400353      3049400358   3049400365         3049400365
31        -2      -2   319276487             26  3049400353      3049400359   3049400365         3049400365

我想压缩这个数据帧,以便它通过RowID和SourceVariationID总结ResponseCount中的总数。

例如:

    ColumnID  MenuID  QuestionID  ResponseCount       RowID  SourceColumnID  SourceRowID  SourceVariationID
22        -2      -2   319276487             96  3049400354      3049400356   3049400365         3049400365
23        -2      -2   319276487             243  3049400353      3049400356   3049400365

这是我到目前为止所提出的:

(Pdb) new_df = df.groupby(['RowID', 'SourceVariationID', 'SourceRowID']).sum()                                                                          
(Pdb) new_df['ColumnID'] = -2
(Pdb) new_df['MenuID'] = -2
(Pdb) pp new_df
                                          ColumnID  MenuID  QuestionID  ResponseCount  SourceColumnID
RowID      SourceVariationID SourceRowID                                                             
3031434948 3031434943        3031434943         -2      -2  3805083612            141     36377219262
           3031434945        3031434945         -2      -2  4439264214            237     42440089136

[2 rows x 5 columns]

2 个答案:

答案 0 :(得分:2)

您可以执行以下操作:

print df
   ColumnID  MenuID  QuestionID  ResponseCount       RowID  SourceVariationID
0        -2      -2   319276487             28  3049400354         3049400365
1        -2      -2   319276487             31  3049400354         3049400365
2        -2      -2   319276487             37  3049400354         3049400365
3        -2      -2   319276487             28  3049400353         3049400365
4        -2      -2   319276487             45  3049400353         3049400365
5        -2      -2   319276487             46  3049400353         3049400365
6        -2      -2   319276487             26  3049400353         3049400365
7        -2      -2   319276487             33  3049400353         3049400365
8        -2      -2   319276487             39  3049400353         3049400365
9        -2      -2   319276487             26  3049400353         3049400365


def squash(group):
    x = group.iloc[1,:].drop(['RowID','SourceVariationID'])
    x['ResponseCount'] = group['ResponseCount'].sum()
    return x

print df.groupby(['RowID','SourceVariationID']).apply(squash)

                             ColumnID  MenuID  QuestionID  ResponseCount
RowID      SourceVariationID                                             
3049400353 3049400365               -2      -2   319276487            243
3049400354 3049400365               -2      -2   319276487             96

答案 1 :(得分:0)

假设您的其他列是整数:

columns = df.columns.tolist()
columns.remove('ResponseCount')
columns.remove('RowID')
tempDf = df.groupby(['RowID'])[['ResponseCount']].sum()
tempDf = tempDf.join(df.groupby(['RowID'])[columns].min())
tempDf['RowID'] = tempDf.index

快速解决方案,不是一个伟大的解决方案! 希望这会有所帮助。