删除与每个组中的最后一个子组相对应的行

时间:2018-08-23 11:40:52

标签: python pandas dataframe grouping pandas-groupby

假设我有以下DataFrame

var popularToys = [
    "cards", "pogo-stick"
]

var data = {
  "toys": [
    {
      "name": "car",
      "price": "10"
    },
    {
      "name": "duck",
      "price": "25"
    },
    {
      "name": "pogo-stick",
      "price": "60"
    },
    {
      "name": "cards",
      "price": "5"
    }
  ]
};
popularToys.forEach(function(toy, index){
  var toyObjIndex = data.toys.findIndex(x => x.name==toy);
  //swap
  var tempObj = data.toys[toyObjIndex];
  data.toys[toyObjIndex] = data.toys[index];
  data.toys[index] = tempObj;
});

console.log(data);

如下所示:

import numpy as np
import pandas as pd
df = pd.DataFrame(['eggs', np.nan, 'ham', 'eggs', 'spam', 'spam',
                   'eggs', 'spam', np.nan], columns=['ingredients'])
df['customer'] = (['Badger']*3 + ['Shopkeeper']*3 + ['Pepperpots']*2
    + [np.nan])
df['ordered'] = [1, 1, 0, 0, 1, 0, 1, 0, np.nan]
df.sort_values(['customer', 'ingredients'], inplace=True)

对于每个客户,我想删除与最后一种成分相对应的行(根据字母顺序)。

例如,应删除索引为4和5的行,因为它们与店主的最后一个成分相对应。

类似地,应该删除第7行,因为它对应于Pepperpots的最后一种成分。

ingredients customer ordered 0 eggs Badger 1.0 2 ham Badger 0.0 1 NaN Badger 1.0 6 eggs Pepperpots 1.0 7 spam Pepperpots 0.0 3 eggs Shopkeeper 0.0 4 spam Shopkeeper 1.0 5 spam Shopkeeper 0.0 8 NaN NaN NaN 值应被忽略。

2 个答案:

答案 0 :(得分:2)

您可以创建一个由分组的“最后”成分组成的系列,然后将其过滤掉。请注意,为此目的,NaN成分不会被除去。

s = df.sort_values('ingredients')\
      .groupby('customer')['ingredients']\
      .transform('last').sort_index()

df = df[df['ingredients'] != s]

print(df)

  ingredients    customer  ordered
0        eggs      Badger      1.0
1         NaN      Badger      1.0
3        eggs  Shopkeeper      0.0
6        eggs  Pepperpots      1.0
8         NaN         NaN      NaN

使用此解决方案,您可以省略df.sort_values(['customer', 'ingredients'], inplace=True),因为上面实现的GroupBy + transform按索引对齐。

答案 1 :(得分:1)

使用GroupBy.transform,默认情况下,boolean indexing会过滤掉NaN的值:

s = df['ingredients'].groupby(df['customer']).transform('last')
df = df[df['ingredients'] != s]
print (df)
  ingredients    customer  ordered
0        eggs      Badger      1.0
1         NaN      Badger      1.0
6        eggs  Pepperpots      1.0
3        eggs  Shopkeeper      0.0
8         NaN         NaN      NaN