从此表中,我尝试通过数据框中可用的最小/最大每周日期来插入缺失的日期。然后,计算每个类别的0销售发生。
df=pd.DataFrame({'category_id': ['aaa','aaa','aaa','aaa','bbb','bbb','bbb','ccc','ccc'],
'week': ['2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26','2015-01-12', '2015-01-19', '2015-01-26','2015-01-05', '2015-01-12'],
'sales': [0,20,30,10,45,0,47,0,10]})
第一步:将缺失的每周日期添加到所有类别中,并在缺失的日期中填充0(第一季度:我不确定如何获得此df_add_missing_dates结果)
# expected dates interpolation output
df_add_missing_dates=pd.DataFrame({'category_id': ['aaa','aaa','aaa','aaa','bbb','bbb','bbb','bbb','ccc','ccc','ccc','ccc'],
'week': ['2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26',
'2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26',
'2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26'],
'sales': [0,20,30,10,
0,45,0,47,
0,10,0,0]})
第二步:计算每周销售量为0(第二季度:如何汇总每个类别的销售量= 0?)
# expected final output
category_id | sales_0_count
aaa | 1
bbb | 2
ccc | 3
当前代码和逻辑:
# convert string to datetime and set as index
df['week'] = pd.to_datetime(df['week'], format='%Y-%m-%d')
# find min/max weekly dates in the dataframe --> I couldn't add missing dates with 0 sales though
idx = pd.period_range(start=df.week.min(),end=df.week.max(),freq='W')
df = df.reindex(idx, fill_value=0).reset_index(drop=True)
df_add_missing_dates = df
# group by category to count how many times weekly sales is 0
答案 0 :(得分:1)
IIUC,您可以将pd.MultiIndex.from_products
与reindex
和fill_value = 0
一起使用,然后将布尔矩阵和groupby
与sum
一起使用:
idx = pd.MultiIndex.from_product([df['category_id'].unique(),
df['week'].unique()],
names=['category_id', 'week'])
df_missing = (df.set_index(['category_id', 'week'])
.reindex(idx, fill_value=0)
.reset_index())
df_missing
输出:
category_id week sales
0 aaa 2015-01-05 0
1 aaa 2015-01-12 20
2 aaa 2015-01-19 30
3 aaa 2015-01-26 10
4 bbb 2015-01-05 0
5 bbb 2015-01-12 45
6 bbb 2015-01-19 0
7 bbb 2015-01-26 47
8 ccc 2015-01-05 0
9 ccc 2015-01-12 10
10 ccc 2015-01-19 0
11 ccc 2015-01-26 0
现在,分组并求和:
(df_missing == 0).groupby(df_missing['category_id'])['sales'].sum()
输出:
category_id
aaa 1.0
bbb 2.0
ccc 3.0
Name: sales, dtype: float64
答案 1 :(得分:0)
不确定重新索引部分是什么用途,但是在
之后df['week'] = pd.to_datetime(df['week'], format='%Y-%m-%d')
您可以这样做:
groupedDf = df.groupby(['category_id', pd.Grouper(key='week', freq='W-MON')])['sales'].sum().reset_index().sort_values('week')
zeroSalesWeek = groupedDf[groupedDf.sales == 0]
输出:
zeroSalesWeek
category_id week sales
0 aaa 2015-01-05 0
4 bbb 2015-01-05 0
8 ccc 2015-01-05 0
6 bbb 2015-01-19 0
10 ccc 2015-01-19 0
11 ccc 2015-01-26 0
要选择特定的category_id,您可以尝试:
df[(df.sales == 0) & (df.category_id=='bbb')]
这会给你
category_id week sales
4 bbb 2015-01-05 0
6 bbb 2015-01-19 0
此外,如果您认为这可能会花费一些时间,则可以始终创建一个快速函数来选择特定的category_id,例如:
def zeroGroupedDf(df, category_id):
category_id = str(category_id)
tempDf = df[(df.sales == 0) & (df.category_id==category_id)]
return tempDf
并调用要创建新df的任何category_id,例如:
test = zeroGroupedDf(df, 'bbb')
test
category_id week sales
4 bbb 2015-01-05 0
6 bbb 2015-01-19 0
答案 2 :(得分:0)
这将以粗略的方式为您提供预期的输出:
df_add_missing_dates[df_add_missing_dates.sales.eq(0)].groupby('category_id')['sales'].count()
如果您希望获得期望的实际数据帧(尽管可以做得更好):
expected_output = df_add_missing_dates[df_add_missing_dates.sales.eq(0)].\
groupby('category_id',as_index=False)['sales'].count().\
rename({'sales':'sales_0_count'},axis=1)
答案 3 :(得分:0)
我是这样做的:
fitted