下午全部,
我有一个非常大的数据集,我已将其分组。这是一个示例:
<Results>
<ResultSet fetchSize="10">
<Row rowNumber="1">
<REL_ID>22439129</REL_ID>
<EFF_TMSTP>2015-09-14 07:08:31.246</EFF_TMSTP>
<RETIRED_TRAN_ID>63859659</RETIRED_TRAN_ID>
<OBJ_ID>1371718</OBJ_ID>
<REL_OBJ_ID>1350658</REL_OBJ_ID>
<CUST_VIEW_CD>CORE</CUST_VIEW_CD>
<CUST_MODEL_CD>CESE</CUST_MODEL_CD>
<TYPE_CD>DRVD</TYPE_CD>
<UNLINK_FLG>N</UNLINK_FLG>
<EXPR_TMSTP>9999-12-31 23:59:59.999</EXPR_TMSTP>
<TRAN_TMSTP>2015-09-14 07:08:31.246</TRAN_TMSTP>
<RETIRED_TRAN_TMSTP>9999-12-31 23:59:59.999</RETIRED_TRAN_TMSTP>
<TRAN_ID>63859659</TRAN_ID>
</Row>
<Row rowNumber="2">
<REL_ID>22482704</REL_ID>
<EFF_TMSTP>2015-09-24 06:27:43.358</EFF_TMSTP>
<RETIRED_TRAN_ID>64285012</RETIRED_TRAN_ID>
<OBJ_ID>11983064</OBJ_ID>
<REL_OBJ_ID>1350658</REL_OBJ_ID>
<CUST_VIEW_CD>CORE</CUST_VIEW_CD>
<CUST_MODEL_CD>CESE</CUST_MODEL_CD>
<TYPE_CD>DRVD</TYPE_CD>
<UNLINK_FLG>N</UNLINK_FLG>
<EXPR_TMSTP>9999-12-31 23:59:59.999</EXPR_TMSTP>
<TRAN_TMSTP>2015-09-24 06:27:43.358</TRAN_TMSTP>
<RETIRED_TRAN_TMSTP>9999-12-31 23:59:59.999</RETIRED_TRAN_TMSTP>
<TRAN_ID>64285012</TRAN_ID>
</Row>
<Row rowNumber="3">
<REL_ID>25372326</REL_ID>
<EFF_TMSTP>2016-07-08 04:46:02.591</EFF_TMSTP>
<RETIRED_TRAN_ID>81170279</RETIRED_TRAN_ID>
<OBJ_ID>13613079</OBJ_ID>
<REL_OBJ_ID>1350658</REL_OBJ_ID>
<CUST_VIEW_CD>CORE</CUST_VIEW_CD>
<CUST_MODEL_CD>CI</CUST_MODEL_CD>
<TYPE_CD>DRVD</TYPE_CD>
<UNLINK_FLG>N</UNLINK_FLG>
<EXPR_TMSTP>9999-12-31 23:59:59.999</EXPR_TMSTP>
<TRAN_TMSTP>2016-07-08 04:46:02.595</TRAN_TMSTP>
<RETIRED_TRAN_TMSTP>9999-12-31 23:59:59.999</RETIRED_TRAN_TMSTP>
<TRAN_ID>81170279</TRAN_ID>
</Row>
</ResultSet>
</Results>
输出:
df_ccy = df.groupby(['currency_str','state'
['state'].count().reset_index(name='count').sort_values(['count'], ascending=False)
display(df_ccy)
我只想表明:
currency_str state count
USD Traded Away 148
AUD Dealer Timeout 52
CAD Done 44
USD Covered 38
USD Dealer Timeout 29
ZAR Done 22
我是通过以下方式实现的:
CAD Done 44
ZAR Done 22
我应该如上所述在原始groupby语句或过滤器上使用Lambda函数吗?什么是最佳做法?
答案 0 :(得分:1)
df = pd.DataFrame({'currency_str': ['USD', 'AUD', 'CAD', 'CAD', 'ZAR',
'ZAR', 'USD', 'USD', 'ZAR'],
'state': ['Traded Away', 'Dealer Timeout', 'Done', 'Done', 'Done',
'Done', 'Covered', 'Dealer Timeout', 'Done']})
print (df)
currency_str state
0 USD Traded Away
1 AUD Dealer Timeout
2 CAD Done
3 CAD Done
4 ZAR Done
5 ZAR Done
6 USD Covered
7 USD Dealer Timeout
8 ZAR Done
我认为你需要先过滤:
df1 = df[df['state']=='Done']
#alternative
#df1 = df.query("state == 'Done'")
然后数:
df_ccy = (df1.groupby(['currency_str','state'])['state']
.count()
.reset_index(name='count')
.sort_values(['count'], ascending=False))
print (df_ccy)
currency_str state count
1 ZAR Done 3
0 CAD Done 2
或者,如果不是具有相同过滤值的重要列状态:
df_ccy = (df1['currency_str'].value_counts()
.reset_index(name='count')
.rename(columns={'index':'currency_str'}))
print (df_ccy)
currency_str count
0 ZAR 3
1 CAD 2