我的previous question的扩展名。我有一个源数据框,其中包含三列:客户,日期和项目。我想添加一个包含“项目历史记录”的新列,该列是该客户在早期(由日期定义)行中所有项目的数组。 如果客户在同一日期进行了多次购买,则该行的任何一项都不应在另一项的历史记录中列出。
因此,鉴于此示例数据:
df = pd.DataFrame({'Customer':['Bert', 'Bert', 'Bert', 'Bert', 'Bert', 'Ernie', 'Ernie', 'Ernie', 'Ernie', 'Steven', 'Steven'], 'Date':['01/01/2019', '15/01/2019', '20/01/2019', '20/01/2019', '22/01/2019', '01/01/2019', '15/01/2019', '20/01/2019', '22/01/2019', '01/01/2019' ,'15/01/2019'], 'Item':['Bread', 'Cheese', 'Apples', 'Pears', 'Toothbrush', 'Toys', 'Shellfish', 'Dog', 'Yoghurt', 'Toilet', 'Dominos']})
Customer Date Item
Bert 01/01/2019 Bread
Bert 15/01/2019 Cheese
Bert 20/01/2019 Apples
Bert 20/01/2019 Pears
Bert 22/01/2019 Toothbrush
Ernie 01/01/2019 Toys
Ernie 15/01/2019 Shellfish
Ernie 20/01/2019 Dog
Ernie 22/01/2019 Yoghurt
Steven 01/01/2019 Toilet
Steven 15/01/2019 Dominos
我想看到的输出是:
Customer Date Item Item History
Bert 01/01/2019 Bread NaN
Bert 15/01/2019 Cheese [Bread]
Bert 20/01/2019 Apples [Bread, Cheese]
Bert 20/01/2019 Pears [Bread, Cheese]
Bert 22/01/2019 Toothbrush [Bread, Cheese, Apples, Pears]
Ernie 01/01/2019 Toys NaN
Ernie 15/01/2019 Shellfish [Toys]
Ernie 20/01/2019 Dog [Toys, Shellfish]
Ernie 22/01/2019 Yoghurt [Toys, Shellfish, Dog]
Steven 01/01/2019 Toilet NaN
Steven 15/01/2019 Dominos [Toilet]
请注意,对于Bert在20/01/2019的购买,``历史记录''列均未包含对方的商品。对于他在2019年1月22日购买的商品,都包括了2019年1月20日起的两个商品。
上一个问题的答案是列表理解的精妙之处,形式为:
df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer')
for i in range(len(x))]
df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan
但是显然x.Item[:i]
中的“ i”需要计算出日期与当前行不同的最后一行。任何关于实现该目标的建议都将受到赞赏。
答案 0 :(得分:2)
使用apply
和np.cumsum()
的另一种方式:
#aggregates Item as list per 'Customer'& 'Date'
m=df.groupby(['Customer','Date'])['Item'].apply(lambda x:
[*itertools.chain.from_iterable([x])])
#groups each Customer and cumsum the list with shift
n=m.groupby(level=0).apply(lambda x:np.cumsum(x).shift())
df.set_index(['Customer','Date']).assign(Item=n).reset_index() #assign back
Customer Date Item
0 Bert 01/01/2019 NaN
1 Bert 15/01/2019 [Bread]
2 Bert 20/01/2019 [Bread, Cheese]
3 Bert 20/01/2019 [Bread, Cheese]
4 Bert 22/01/2019 [Bread, Cheese, Apples, Pears]
5 Ernie 01/01/2019 NaN
6 Ernie 15/01/2019 [Toys]
7 Ernie 20/01/2019 [Toys, Shellfish]
8 Ernie 22/01/2019 [Toys, Shellfish, Dog]
9 Steven 01/01/2019 NaN
10 Steven 15/01/2019 [Toilet]
答案 1 :(得分:2)
想法是通过DataFrame.duplicated
来区分每组重复的值,然后将这些值替换为NaN
并用正向填充缺失值。
每组的第一个值始终为空字符串,因此不必按组替换:
df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer')
for i in range(len(x))]
df['Item History'] = df['Item History'].mask(df.duplicated(['Customer','Date'])).ffill()
df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan
print (df)
Customer Date Item Item History
0 Bert 01/01/2019 Bread NaN
1 Bert 15/01/2019 Cheese [Bread]
2 Bert 20/01/2019 Apples [Bread, Cheese]
3 Bert 20/01/2019 Pears [Bread, Cheese]
4 Bert 22/01/2019 Toothbrush [Bread, Cheese, Apples, Pears]
5 Ernie 01/01/2019 Toys NaN
6 Ernie 15/01/2019 Shellfish [Toys]
7 Ernie 20/01/2019 Dog [Toys, Shellfish]
8 Ernie 22/01/2019 Yoghurt [Toys, Shellfish, Dog]
9 Steven 01/01/2019 Toilet NaN
10 Steven 15/01/2019 Dominos [Toilet]
答案 2 :(得分:1)
仅使用apply一个可能更简单的答案-这可能比其他方法慢:
df['item history'] = df.apply(lambda x:
[i for i in list(df.loc[(df.Date<x.Date)&(df.Customer==x.Customer),'Item'])], axis=1)
结果:
Customer ... item history
0 Bert ... []
1 Bert ... [Bread]
2 Bert ... [Bread, Cheese]
3 Bert ... [Bread, Cheese]
4 Bert ... [Bread, Cheese, Apples, Pears]
5 Ernie ... []
6 Ernie ... [Toys]
7 Ernie ... [Toys, Shellfish]
8 Ernie ... [Toys, Shellfish, Dog]
9 Steven ... []
10 Steven ... [Toilet]
如果要列出唯一项,则可能要在结果中添加list(set())
。