根据groupby条件从数据框中删除值

时间:2016-09-18 14:43:50

标签: python pandas

我在这里试图确定如何切割数据帧。

data = {'Date' : ['08/20/10','08/20/10','08/20/10','08/21/10','08/22/10','08/24/10','08/25/10','08/26/10'] , 'Receipt' : [10001,10001,10002,10002,10003,10004,10004,10004],
   'Product' : ['xx1','xx2','yy1','fff4','gggg4','fsf4','gggh5','hhhg6']}

dfTest = pd.DataFrame(data)
dfTest

这将产生:

    Date    Product    Receipt
0   08/20/10    xx1    10001
1   08/20/10    xx2    10001
2   08/20/10    yy1    10002
3   08/21/10    fff4    10002
4   08/22/10    gggg4   10003
5   08/24/10    fsf4    10004
6   08/25/10    gggh5   10004
7   08/26/10    hhhg6   10004

我想创建一个仅包含唯一收据的新数据框,这意味着收据应仅在1天内使用(但可以在1天内多次显示)。如果收据在多天内显示,则需要将其删除。上述数据集应如下所示:

    Date    Product    Receipt
0   08/20/10    xx1    10001
1   08/20/10    xx2    10001
2   08/22/10    gggg4   10003

到目前为止,我所做的是:

dfTest.groupby(['Receipt','Date']).count()

              Product
Receipt Date    
10001   08/20/10    2
10002   08/20/10    1
        08/21/10    1
10003   08/22/10    1
10004   08/24/10    1
        08/25/10    1
        08/26/10    1

我不知道如何在那种结构中查询该日期,所以我重置了索引。

df1 = dfTest.groupby(['Receipt','Date']).count().reset_index()


Receipt Date    Product
0   10001   08/20/10    2
1   10002   08/20/10    1
2   10002   08/21/10    1
3   10003   08/22/10    1
4   10004   08/24/10    1
5   10004   08/25/10    1
6   10004   08/26/10    1

现在我不知道该怎么办。我希望那里的人可以伸出援助之手。这可能很容易,我只是有点困惑或缺乏经验。

1 个答案:

答案 0 :(得分:1)

您可以将SeriesGroupBy.nuniqueboolean indexing一起使用,其中条件使用Series.isin

df1 = dfTest.groupby(['Receipt'])['Date'].nunique()
print (df1)
Receipt
10001    1
10002    2
10003    1
10004    3
Name: Date, dtype: int64

#get indexes of all rows where length is 1
print (df1[df1 == 1].index)
Int64Index([10001, 10003], dtype='int64', name='Receipt')

#get all rows where in column Receipt are indexes with length 1
print (dfTest[dfTest['Receipt'].isin(df1[df1 == 1].index)])
       Date Product  Receipt
0  08/20/10     xx1    10001
1  08/20/10     xx2    10001
4  08/22/10   gggg4    10003

另一种解决方案,按条件查找索引,然后按loc选择DataFrame

print (dfTest.groupby(['Receipt']).filter(lambda x: x.Date.nunique()==1).index)
Int64Index([0, 1, 4], dtype='int64')


df1 = dfTest.loc[dfTest.groupby(['Receipt']).filter(lambda x: x.Date.nunique()==1).index]
print (df1)
       Date Product  Receipt
0  08/20/10     xx1    10001
1  08/20/10     xx2    10001
4  08/22/10   gggg4    10003