Filter Pandas DataFrame by GroupBy Contents

时间:2016-07-11 19:45:15

标签: python pandas dataframe scipy

Consider the following DataFrame:

    records = [{'item': 'Widget A', 'quantity': 50, 'revenue': 25.0, 'trandate': '2016-3-24'},
        {'item': 'Widget B', 'quantity': 6, 'revenue': 72.0, 'trandate': '2016-3-28'},
        {'item': 'Widget C', 'quantity': 5, 'revenue': 75.0, 'trandate': '2016-3-28'},
        {'item': 'Widget A', 'quantity': 168, 'revenue': 84.0, 'trandate': '2016-3-29'},
        {'item': 'Widget B', 'quantity': 6, 'revenue': 84.0, 'trandate': '2016-3-29'}]
    indices = [487, 488, 493, 495, 497]
    df = pd.DataFrame(records, index=indices)

yielding

    id  item       quantity  revenue   trandate
    487  Widget A        50     25.0  2016-3-24
    488  Widget B         6     72.0  2016-3-28
    493  Widget C         6     75.0  2016-3-28
    495  Widget A         6     84.0  2016-3-29
    497  Widget B         6     84.0  2016-3-29

I need to split this DataFrame into two complementary sets:

  1. A DataFrame that contains the first transactions for each item:

    id  item       quantity  revenue   trandate
    487  Widget A        50     25.0  2016-3-24
    488  Widget B         6     72.0  2016-3-28
    493  Widget C         6     75.0  2016-3-28
    
  2. A DataFrame that excludes the first transactions for each item:

     id  item       quantity  revenue   trandate
    495  Widget A         6     84.0  2016-3-29
    497  Widget B         6     84.0  2016-3-29
    

I would like to filter df by a GroupedBy object, but I can't get df's indices to show up after I groupby:

    gb = df.groupby('item')
    >>> gb.groups
    # {'Widget A': [487, 495], 'Widget B': [488, 497], 'Widget C': [493]}
    >>> gb['trandate'].min()
    item
    Widget A    2016-3-24
    Widget B    2016-3-28
    Widget C    2016-3-28

Can I use GroupBy to yield a DataFrame like:

    id   item
    487  Widget A    2016-3-24
    488  Widget B    2016-3-28
    493  Widget C    2016-3-28

1 个答案:

答案 0 :(得分:3)

I think you need filter by mask created by cumcount:

print (df.groupby('item').cumcount())
487    0
488    0
493    0
495    1
497    1
dtype: int64

print (df[df.groupby('item').cumcount() == 0])
         item  quantity  revenue   trandate
487  Widget A        50     25.0  2016-3-24
488  Widget B         6     72.0  2016-3-28
493  Widget C         5     75.0  2016-3-28

print (df[df.groupby('item').cumcount() > 0])
         item  quantity  revenue   trandate
495  Widget A       168     84.0  2016-3-29
497  Widget B         6     84.0  2016-3-29