如何使用group by并返回具有空值的行

时间:2015-12-28 07:10:52

标签: python numpy pandas dataframe missing-data

我在电子邮件和购买方面有如下数据集。

OPTIONS

我想查找数据集中的总人数,购买人数以及订单总数和总收入金额。我知道如何通过Email Purchaser order_id amount a@gmail.com a@gmail.com 1 5 b@gmail.com c@gmail.com c@gmail.com 2 10 c@gmail.com c@gmail.com 3 5 使用SQL和聚合函数来执行此操作,但我不知道如何使用left join / Python复制此操作。

对于pandas,我使用Pythonpandas尝试了此操作:

numpy

问题是 - 它只返回带有顺序的行(第1行和第3行)而不返回其他行(第2行)

table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'})

table1.agg({'Email': 'count', 'Purchaser': 'count', 'amount': np.sum, 'order_id': 'count'})

Email Purchaser order_id amount a@gmail.com a@gmail.com 1 5 c@gmail.com c@gmail.com 2 15 查询应如下所示:

SQL

如何在SELECT count(Email) as num_ind, count(Purchaser) as num_purchasers, sum(order) as orders , sum(amount) as revenue FROM (SELECT Email, Purchaser, count(order_id) as order, sum(amount) as amount FROM table 1 GROUP BY Email, Purchaser) x 中复制它?

1 个答案:

答案 0 :(得分:4)

现在还没有在pandas中实现 - see

因此,一个糟糕的解决方案是将[ "5-7" => "Red / S", "5-8" => "Red / M", "6-7" => "Blue / S", "6-8" => "Blue / M" ] 替换为某个字符串,并在NaN替换为agg之后:

NaN
table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy')