分组数据帧然后在pandas中过滤的最有效方法

时间:2018-03-01 07:07:29

标签: pandas dataframe group-by

下午全部,

我有一个非常大的数据集,我已将其分组。这是一个示例:

<Results>
    <ResultSet fetchSize="10">
             <Row rowNumber="1">
                <REL_ID>22439129</REL_ID>
                <EFF_TMSTP>2015-09-14 07:08:31.246</EFF_TMSTP>
                <RETIRED_TRAN_ID>63859659</RETIRED_TRAN_ID>
                <OBJ_ID>1371718</OBJ_ID>
                <REL_OBJ_ID>1350658</REL_OBJ_ID>
                <CUST_VIEW_CD>CORE</CUST_VIEW_CD>
                <CUST_MODEL_CD>CESE</CUST_MODEL_CD>
                <TYPE_CD>DRVD</TYPE_CD>
                <UNLINK_FLG>N</UNLINK_FLG>
                <EXPR_TMSTP>9999-12-31 23:59:59.999</EXPR_TMSTP>
                <TRAN_TMSTP>2015-09-14 07:08:31.246</TRAN_TMSTP>
                <RETIRED_TRAN_TMSTP>9999-12-31 23:59:59.999</RETIRED_TRAN_TMSTP>
                <TRAN_ID>63859659</TRAN_ID>
            </Row>
            <Row rowNumber="2">
                <REL_ID>22482704</REL_ID>
                <EFF_TMSTP>2015-09-24 06:27:43.358</EFF_TMSTP>
                <RETIRED_TRAN_ID>64285012</RETIRED_TRAN_ID>
                <OBJ_ID>11983064</OBJ_ID>
                <REL_OBJ_ID>1350658</REL_OBJ_ID>
                <CUST_VIEW_CD>CORE</CUST_VIEW_CD>
                <CUST_MODEL_CD>CESE</CUST_MODEL_CD>
                <TYPE_CD>DRVD</TYPE_CD>
                <UNLINK_FLG>N</UNLINK_FLG>
                <EXPR_TMSTP>9999-12-31 23:59:59.999</EXPR_TMSTP>
                <TRAN_TMSTP>2015-09-24 06:27:43.358</TRAN_TMSTP>
                <RETIRED_TRAN_TMSTP>9999-12-31 23:59:59.999</RETIRED_TRAN_TMSTP>
                <TRAN_ID>64285012</TRAN_ID>
            </Row>
            <Row rowNumber="3">
                <REL_ID>25372326</REL_ID>
                <EFF_TMSTP>2016-07-08 04:46:02.591</EFF_TMSTP>
                <RETIRED_TRAN_ID>81170279</RETIRED_TRAN_ID>
                <OBJ_ID>13613079</OBJ_ID>
                <REL_OBJ_ID>1350658</REL_OBJ_ID>
                <CUST_VIEW_CD>CORE</CUST_VIEW_CD>
                <CUST_MODEL_CD>CI</CUST_MODEL_CD>
                <TYPE_CD>DRVD</TYPE_CD>
                <UNLINK_FLG>N</UNLINK_FLG>
                <EXPR_TMSTP>9999-12-31 23:59:59.999</EXPR_TMSTP>
                <TRAN_TMSTP>2016-07-08 04:46:02.595</TRAN_TMSTP>
                <RETIRED_TRAN_TMSTP>9999-12-31 23:59:59.999</RETIRED_TRAN_TMSTP>
                <TRAN_ID>81170279</TRAN_ID>
            </Row>
    </ResultSet>
</Results>

输出:

df_ccy = df.groupby(['currency_str','state' 
 ['state'].count().reset_index(name='count').sort_values(['count'], ascending=False)

display(df_ccy)

我只想表明:

currency_str    state           count
USD             Traded Away     148
AUD             Dealer Timeout  52
CAD             Done            44
USD             Covered         38
USD             Dealer Timeout  29
ZAR             Done            22

我是通过以下方式实现的:

CAD             Done            44
ZAR             Done            22

我应该如上所述在原始groupby语句或过滤器上使用Lambda函数吗?什么是最佳做法?

1 个答案:

答案 0 :(得分:1)

df = pd.DataFrame({'currency_str': ['USD', 'AUD', 'CAD', 'CAD', 'ZAR', 
                                    'ZAR', 'USD', 'USD', 'ZAR'], 
                   'state': ['Traded Away', 'Dealer Timeout', 'Done', 'Done', 'Done',
                             'Done', 'Covered', 'Dealer Timeout', 'Done']})

print (df)
  currency_str           state
0          USD     Traded Away
1          AUD  Dealer Timeout
2          CAD            Done
3          CAD            Done
4          ZAR            Done
5          ZAR            Done
6          USD         Covered
7          USD  Dealer Timeout
8          ZAR            Done

我认为你需要先过滤:

df1 = df[df['state']=='Done']
#alternative
#df1 = df.query("state == 'Done'")

然后数:

df_ccy = (df1.groupby(['currency_str','state'])['state']
            .count()
            .reset_index(name='count')
            .sort_values(['count'], ascending=False))

print (df_ccy)
  currency_str state  count
1          ZAR  Done      3
0          CAD  Done      2

或者,如果不是具有相同过滤值的重要列状态:

df_ccy = (df1['currency_str'].value_counts()
                            .reset_index(name='count')
                            .rename(columns={'index':'currency_str'}))
print (df_ccy)
  currency_str  count
0          ZAR      3
1          CAD      2