我有600万笔交易数据,因此我需要一些功能来快速运行。 基本上,我有唯一的客户ID和他们保留并最终开车的汽车类别。客户可能有一种或多种租车经验。对于每个时间点的特定客户,我想结合独特的汽车等级(预订和驾驶)来计算他/她有多少独特的不同汽车等级体验
实际上,我的数据甚至不按此顺序排列,这意味着ID和日期未排序。为了方便起见,下面显示的布局。如果您还可以解决未解决的问题,那就太好了!
谢谢!
数据如下:
id date reserved drove
1 2017 A B
1 2018 B A
1 2019 A C
2 2017 A B
2 2018 C D
3 2018 D D
我想要这个结果:
id date experience
1 2017 2 #(A+B)
1 2018 2 #still the same as 2017 because this customer just experienced A and B (A+B)
1 2019 3 #one more experience because C is new car class (A+B+C)
2 2017 2 #(A+B)
2 2018 4 #(A+B+C+D)
3 2018 1 #(D)
答案 0 :(得分:1)
这个怎么样?使用列表理解功能,因为pandas DF不适用于处理集合(这最终就是这个问题)。
df = pd.DataFrame([
[1, 2017, 'a', 'b'],
[1, 2018, 'a', 'b'],
[1, 2019, 'a', 'c'],
[2, 2017, 'a', 'b'],
[2, 2018, 'c', 'd'],
[3, 2018, 'd', 'd'],
], columns=['id', 'date', 'reserved', 'drove'])
list_of_sets = [(v[0], v[1], {v[2], v[3]}) for v in df.values]
sorted_list = sorted(list_of_sets) # not necc if sorted before
result = pd.DataFrame([
(info[0], info[1], len(info[2].union(sorted_list[i-1][2])))
if info[0] == sorted_list[i-1][0]
else (info[0], info[1], len(info[2]))
for i, info in enumerate(sorted_list)
], columns=['id', 'date', 'count'])
答案 1 :(得分:1)
这是一种基于numpy的方法:
import numpy as np
# sort values column-wise
df[['reserved','drove']] = np.sort(df[['reserved','drove']])
# sort values by id, reserved and drove
df = df.sort_values(['id','reserved','drove'])
现在让我们定义一些条件以获得期望的输出:
# Does the id change?
c1 = df.id.ne(df.id.shift()).values
# is the next row the same? (for each col individually)
c2 = (df[['reserved','drove']].ne(df[['reserved','drove']].shift(1))).values
# Is the value in "drove" the same?
c3 = (df[['reserved','drove']].ne(df[['reserved','drove']].shift(1, axis=1))).values
df['experience'] = ((c2 + c1[:,None]) * c3).sum(1)
df = df[['id','date']].assign(experience = df.groupby('id').experience.cumsum())
print(df)
id date experience
0 1 2017 2
1 1 2018 2
2 1 2019 3
3 2 2017 2
4 2 2018 4
5 3 2018 1
答案 2 :(得分:1)
可以用两行完成(而且我很确定有人可以将它拉成一行):
创建一个保留和行驶所有观测值的列表,然后计算内容(使用总和)
df['aux'] = list(map(list, zip(df.reserved, df.drove)))
df['aux_cum'] = [len(set(x)) for x in df.groupby('id')['aux'].apply(lambda x: x.cumsum())]
输出:
id date reserved drove aux aux_cum
0 1 2017 A B [A, B] 2
1 1 2018 B A [B, A] 2
2 1 2019 A C [A, C] 3
3 2 2017 A B [A, B] 2
4 2 2018 C D [C, D] 4
5 3 2018 D D [D, D] 1
漂亮格式:
print(df.drop(['reserved','drove','aux'], axis=1)
id date aux_cum
0 1 2017 2
1 1 2018 2
2 1 2019 3
3 2 2017 2
4 2 2018 4
5 3 2018 1