Question

我在各种条件下都无法对多列进行分组：

我的数据框如下：

code    product brand   lvl1    lvl2    lvl3     lvl4   lvl5        price
8968653 ABC             Milk    Mother  Toddler         Porridge    69
8968653 ABC     AB              Baby                    Bayi        95

代码和产品名称是公共字段。所有其他列均应根据以下条件进行分组：

两个单元都是空的：显示NaN
一个单元格为空：显示其他值
两个单元都是非空的：通过管道组合这些单元
价格列应显示平均值

预期输出：

code    product brand   lvl1 lvl2        lvl3     lvl4  lvl5         price
8968653 ABC     AB      Milk Mother|Baby Toddler  NaN   Porridge|Bayi    82

Answer 1

我们可以分几个步骤进行操作：

首先，我们获得string类型和numeric的列的列表
第二，我们根据实际情况使用groupby.agg列还是groupby.mean列来使用string或numeric：
我们在不需要|的地方清理数据框。

# Step 1 get string and numeric columns
str_cols = df.iloc[:, 2:-1].columns
num_cols = df.iloc[:, -1:].columns

# Step 2 groupby on string and numeric columns
d1 = df.groupby(['code','product'])[str_cols].agg('|'.join)
d2 = df.groupby(['code', 'product'])[num_cols].mean()

# Join the dataframe back as 1
df = d1.join(d2).reset_index()

输出1：

      code product brand   lvl1         lvl2      lvl3 lvl4           lvl5  price
0  8968653     ABC   |AB  Milk|  Mother|Baby  Toddler|    |  Porridge|Bayi     82

现在，我们通过删除管道|来清理数据框。

df = df.replace('(^\||\b\|\b|\|$)', '', regex=True)

最终输出

      code product brand  lvl1         lvl2     lvl3 lvl4           lvl5  price
0  8968653     ABC    AB  Milk  Mother|Baby  Toddler       Porridge|Bayi     82

Answer 2

您需要定义一个函数：

def f(x):
    if x.isna().all():
        return np.nan
    x = x.dropna()
    if x.dtype == 'int64':
        return x.mean()
    x = x.drop_duplicates()
    if len(x)>1:
        return '|'.join(x)
    return x


df.replace('', np.nan).groupby(['code'], as_index=False).agg(f)

输出：

      code product brand  lvl1         lvl2     lvl3  lvl4           lvl5  price
0  8968653     ABC    AB  Milk  Mother|Baby  Toddler   NaN  Porridge|Bayi     82

Answer 3

类似于Erfan的，但是要建立一个汇总字典，因此只分组一次：

# dictate which column does what
str_cols = [col for col in df.columns if col not in ['code','product', 'price']]
agg = {col:'|'.join for col in str_cols}
agg['price'] = 'mean'

# aggregation
new_df = df.groupby(['code','product'],as_index=False).agg(agg)

# strip by columns
# replace would be a better choice, but that'll be copied from Efran's
new_df[str_cols] = new_df[str_cols].apply(lambda x: x.str.strip('\|'))

输出：

    code    product brand   lvl1    lvl2        lvl3    lvl4    lvl5            price
0   8968653 ABC     AB      Milk    Mother|Baby Toddler         Porridge|Bayi   82.0

熊猫-如何在各种条件下对多列进行“分组”？

3 个答案: