我在这里试图确定如何切割数据帧。
data = {'Date' : ['08/20/10','08/20/10','08/20/10','08/21/10','08/22/10','08/24/10','08/25/10','08/26/10'] , 'Receipt' : [10001,10001,10002,10002,10003,10004,10004,10004],
'Product' : ['xx1','xx2','yy1','fff4','gggg4','fsf4','gggh5','hhhg6']}
dfTest = pd.DataFrame(data)
dfTest
这将产生:
Date Product Receipt
0 08/20/10 xx1 10001
1 08/20/10 xx2 10001
2 08/20/10 yy1 10002
3 08/21/10 fff4 10002
4 08/22/10 gggg4 10003
5 08/24/10 fsf4 10004
6 08/25/10 gggh5 10004
7 08/26/10 hhhg6 10004
我想创建一个仅包含唯一收据的新数据框,这意味着收据应仅在1天内使用(但可以在1天内多次显示)。如果收据在多天内显示,则需要将其删除。上述数据集应如下所示:
Date Product Receipt
0 08/20/10 xx1 10001
1 08/20/10 xx2 10001
2 08/22/10 gggg4 10003
到目前为止,我所做的是:
dfTest.groupby(['Receipt','Date']).count()
Product
Receipt Date
10001 08/20/10 2
10002 08/20/10 1
08/21/10 1
10003 08/22/10 1
10004 08/24/10 1
08/25/10 1
08/26/10 1
我不知道如何在那种结构中查询该日期,所以我重置了索引。
df1 = dfTest.groupby(['Receipt','Date']).count().reset_index()
Receipt Date Product
0 10001 08/20/10 2
1 10002 08/20/10 1
2 10002 08/21/10 1
3 10003 08/22/10 1
4 10004 08/24/10 1
5 10004 08/25/10 1
6 10004 08/26/10 1
现在我不知道该怎么办。我希望那里的人可以伸出援助之手。这可能很容易,我只是有点困惑或缺乏经验。
答案 0 :(得分:1)
您可以将SeriesGroupBy.nunique
与boolean indexing一起使用,其中条件使用Series.isin
:
df1 = dfTest.groupby(['Receipt'])['Date'].nunique()
print (df1)
Receipt
10001 1
10002 2
10003 1
10004 3
Name: Date, dtype: int64
#get indexes of all rows where length is 1
print (df1[df1 == 1].index)
Int64Index([10001, 10003], dtype='int64', name='Receipt')
#get all rows where in column Receipt are indexes with length 1
print (dfTest[dfTest['Receipt'].isin(df1[df1 == 1].index)])
Date Product Receipt
0 08/20/10 xx1 10001
1 08/20/10 xx2 10001
4 08/22/10 gggg4 10003
另一种解决方案,按条件查找索引,然后按loc
选择DataFrame
:
print (dfTest.groupby(['Receipt']).filter(lambda x: x.Date.nunique()==1).index)
Int64Index([0, 1, 4], dtype='int64')
df1 = dfTest.loc[dfTest.groupby(['Receipt']).filter(lambda x: x.Date.nunique()==1).index]
print (df1)
Date Product Receipt
0 08/20/10 xx1 10001
1 08/20/10 xx2 10001
4 08/22/10 gggg4 10003