Question

我有一个采用以下形式的数据框：

import pandas as pd
dict = {'id':["1001", "1001", "1001", "1002", "1002", "1002", "1003", "1003", "1003"], 
    'food': ["apple", "ham", "egg", "apple", "pear", "cherry", "cheese", "milk", "cereal"], 
    'fruit':[1, 0, 0, 1, 1, 1, 0, 0, 0],
    'score':[1, 3, 1, 1, 1, 1, 2, 2, 3]} 
df = pd.DataFrame(dict) 

    id      food    fruit   score
0   1001    apple   1       1
1   1001    ham     0       0
2   1001    egg     0       0
3   1002    apple   1       1
4   1002    pear    1       2
5   1002    cherry  1       3
6   1003    cheese  0       0
7   1003    cherry  1       3
8   1003    cheese  0       0

我想创建一个新的数据框，其中有一个行用于单个参与者（即，相同的ID），然后是用于数据的自定义摘要的列，例如：

独特食物的数量
水果总数
总分
等

示例输出：

      id    unique  fruits  score
0   1001    3       1       1
1   1002    3       3       6
2   1003    2       1       3

我可以创建一个新的空数据框架，然后使用逻辑索引填充列来遍历旧数据框架中的唯一ID。但是我的数据框大约有50x10 ^ 6行和〜200,000个唯一ID，因此这将花费非常长的时间。我已经读过，遍历数据帧的行效率很低，但是我不知道如何将替代解决方案应用于我的数据集。

谢谢。

Answer 1

groupby().agg()：

df.groupby('id', as_index=False).agg({'food':'nunique',
                      'fruit':'sum',
                     'score':'sum'})

输出：

     id  food  fruit  score
0  1001     3      1      1
1  1002     3      3      6
2  1003     2      1      3

Answer 2

自pandas >= 0.25.0起，我们有了named aggregations，我们可以在其中进行汇总，同时又给我们的列提供了更有意义的名称，因为我们进行了汇总：

因此，在此示例中，我们可以一次性创建列unique。

df.groupby('id').agg(
    unique=('food', 'nunique'),
    fruits=('fruit', 'sum'),
    score=('score', 'sum')
).reset_index()

     id  unique  fruits  score
0  1001       3       1      1
1  1002       3       3      6
2  1003       2       1      3

汇总熊猫数据框中几行数据

2 个答案: