我想统计每个客户在每个订单日期完成的所有订单,以找出每个订单时完成的订单数量。
输入:
预期输出:
以下代码有效,但速度极慢。 10 万多行需要 10 个小时以上的时间。肯定有更好的方法。
orders_total = []
for y,row in df_dated_filt.iterrows():
orders_total.append(df_dated_filt[(df_dated_filt["order_id"] != row["order_id"]) &
(df_dated_filt["customer_id"] == row["customer_id"]) &
(pd.to_datetime(df_dated_filt['order_date'])<pd.to_datetime(row['order_date']))]["order_id"].count())
df_dated_filt["orders_total"] = orders_total
答案 0 :(得分:2)
尝试 sort_values
按升序获取日期,然后尝试 groupby cumcount
按顺序枚举组:
df['orders_total'] = df.sort_values('order_date').groupby('customer_id').cumcount()
df
:
customer_id order_id order_date orders_total
0 1 12 2019-01-06 1
1 1 22 2019-01-01 0
2 2 34 2018-05-08 0
3 2 33 2018-05-12 1
4 2 38 2018-05-29 2
完整的工作示例:
import pandas as pd
df = pd.DataFrame({
'customer_id': [1, 1, 2, 2, 2],
'order_id': [12, 22, 34, 33, 38],
'order_date': ['2019-01-06', '2019-01-01', '2018-05-08', '2018-05-12',
'2018-05-29']
})
df['order_date'] = pd.to_datetime(df['order_date'])
df['orders_total'] = (
df.sort_values('order_date')
.groupby('customer_id')
.cumcount()
)
print(df)
编辑
假设相同的日期应该通过 rank
每个组具有相同的值:
import pandas as pd
df = pd.DataFrame({
'customer_id': [1, 1, 1, 2, 2, 2],
'order_id': [15, 12, 22, 34, 33, 38],
'order_date': ['2019-01-06', '2019-01-06', '2019-01-01',
'2018-05-08', '2018-05-12', '2018-05-29']
})
df['order_date'] = pd.to_datetime(df['order_date'])
df['orders_total'] = (
df.sort_values('order_date')
.groupby('customer_id')['order_date']
.rank(method='dense').astype(int) - 1
)
df
:
customer_id order_id order_date orders_total
0 1 15 2019-01-06 1
1 1 12 2019-01-06 1
2 1 22 2019-01-01 0
3 2 34 2018-05-08 0
4 2 33 2018-05-12 1
5 2 38 2018-05-29 2