我有一个数据集(Product_ID,date_time,Sold),其中包含在不同日期销售的产品。日期不一致,从一个月起随机13天或更长时间给出9个月。我必须以这样的方式分离数据:每个产品在1-3天,4-7天,8-15天和> 16天内销售了多少产品。 。那么如何使用pandas和其他包
在python中对此进行编码 PRODUCT_ID DATE_LOCATION Sold
0E4234 01-08-16 0:00 2
0E4234 02-08-16 0:00 7
0E4234 04-08-16 0:00 3
0E4234 08-08-16 0:00 1
0E4234 09-08-16 0:00 2
.
. (same product for 9 months sold data)
.
0G2342 02-08-16 0:00 1
0G2342 03-08-16 0:00 2
0G2342 06-08-16 0:00 1
0G2342 09-08-16 0:00 1
0G2342 11-08-16 0:00 3
0G2342 15-08-16 0:00 3
.
.
.(goes for 64 products each with 9 months of data)
.
我甚至不知道如何在python中编写代码 所需的输出是
PRODUCT_ID Days Sold
0E4234 1-3 9
4-7 3
8-15 16
>16 (remaing values sum)
0G2342 1-3 3
4-7 1
8-15 7
>16 (remaing values sum)
.
.(for 64 products)
.
如果至少有人发布了从哪里开始的链接
,那会很高兴答案 0 :(得分:2)
您可以先将日期转换为dtetimes,然后按dt.day
获取日期:
df['DATE_LOCATION'] = pd.to_datetime(df['DATE_LOCATION'], dayfirst=True)
days = df['DATE_LOCATION'].dt.day
然后按cut
分组:
rng = pd.cut(days, bins=[0,3,7,15,31], labels=['1-3', '4-7','8-15', '>=16'])
print (rng)
0 1-3
1 1-3
2 4-7
3 8-15
4 8-15
5 1-3
6 1-3
7 4-7
8 8-15
9 8-15
10 8-15
Name: DATE_LOCATION, dtype: category
Categories (4, object): [1-3 < 4-7 < 8-15 < >=16]
按产品汇总sum
并将Series
分页:
df = df.groupby(["PRODUCT_ID",rng])['Sold'].sum()
print (df)
PRODUCT_ID DATE_LOCATION
0E4234 1-3 9
4-7 3
8-15 3
0G2342 1-3 3
4-7 1
8-15 7
Name: Sold, dtype: int64
如果还需要按year
s计算:
df = df.groupby([df['DATE_LOCATION'].dt.year.rename('YEAR'), "PRODUCT_ID",rng])['Sold'].sum()
print (df)
YEAR PRODUCT_ID DATE_LOCATION
2016 0E4234 1-3 9
4-7 3
8-15 3
0G2342 1-3 3
4-7 1
8-15 7
Name: Sold, dtype: int64
答案 1 :(得分:0)
假设您的数据框名为df。
df["DATE_LOCATION"] = pd.to_datetime(df.DATE_LOCATION)
df["DAY"] = df.DATE_LOCATION.dt.day
def flag(x):
if 1<=x<=3:
return '1-3'
elif 4<=x<=7:
return '4-7'
elif 8<=x<=15:
return '8-15'
else:
return '>16' # maybe you mean '>=16'.
df["Days"] = df.DAY.apply(flag)
df.groupby(["PRODUCT_ID","Days"]).Sold.sum()