我希望获得每个月的销售数量,即使该产品在某个时期内缺少销售数字也是如此。考虑以下示例:
import pandas as pd
import numpy as np
np.random.seed(42)
dates = pd.date_range('1/1/2001','31/12/2001', freq = 'd')
sales = [np.random.randint(100) for _ in range(len(dates))]
product = [['A', 'B', 'C'][np.random.randint(3)] for _ in range(len(dates))]
df = pd.DataFrame({'Dates': dates,
'Sales': sales,
'Product': product
})
march = df.Dates.dt.month == 3
df = df[~march]
所有进行曲数据均被删除。我希望在打印时将这些销售额显示为零:
monthly = pd.Grouper(key='Dates', freq='M')
sum_sales = df.groupby(['Product', monthly])['Sales'].sum()
其中仅针对产品A的sum_sales
如下(注意缺少3月时间步长):
Product Dates
A 2001-01-31 658
2001-02-28 460
2001-04-30 541
2001-05-31 701
2001-06-30 517
2001-07-31 596
2001-08-31 802
2001-09-30 654
2001-10-31 561
2001-11-30 473
2001-12-31 605
但是,如果我只做df.groupby(monthly)['Sales'].sum()
而没有按产品分组,我将得到预期的零。
Dates
2001-01-31 1616
2001-02-28 1256
2001-03-31 0
2001-04-30 1555
2001-05-31 1384
2001-06-30 1451
2001-07-31 1677
2001-08-31 1472
2001-09-30 1535
2001-10-31 1316
2001-11-30 1573
2001-12-31 1403
因此,我想知道如何在groupby
中使用多个事物时,如何将大熊猫显示为零销售来显示缺失的日期。
答案 0 :(得分:2)
我认为您的解决方案应该可行,这似乎是错误的。
可能的解决方案是用resample
代替Grouper
链接两个操作:
sum_sales = df.set_index('Dates').groupby('Product').resample('M')['Sales'].sum()
print (sum_sales)
Product Dates
A 2001-01-31 658
2001-02-28 460
2001-03-31 0
2001-04-30 541
2001-05-31 701
2001-06-30 517
2001-07-31 596
2001-08-31 802
2001-09-30 654
2001-10-31 561
2001-11-30 473
2001-12-31 605
B 2001-01-31 589
2001-02-28 344
2001-03-31 0
2001-04-30 571
2001-05-31 347
2001-06-30 528
2001-07-31 663
2001-08-31 294
2001-09-30 238
2001-10-31 487
2001-11-30 503
2001-12-31 303
C 2001-01-31 369
2001-02-28 452
2001-03-31 0
2001-04-30 443
2001-05-31 336
2001-06-30 406
2001-07-31 418
2001-08-31 376
2001-09-30 643
2001-10-31 268
2001-11-30 597
2001-12-31 495
Name: Sales, dtype: int64