我有一个包含数百万个销售订单的数据框。每行代表一个购物车中的一项。我需要合并订单,尽管在同一天下了订单,但这些订单还是分开的。 更准确地说,同一天在同一天发货的同一位客户的所有订单都应分配给相同的订单ID(不管有问题的是哪一个)。
列:“ customer_id”,“ order_id”,...,“ order_date”,“ ship_date”
我的幼稚解决方案有效,但是速度很慢:
for _, customer_groups in df.groupby(by='customer_id'):
for _, same_day_orders in customer_groups.groupby(by=['order_date', 'ship_date']):
# Only merge if multiple orders per day.
if same_day_orders.shape[0] > 1:
# Now step through the line items two at a time.
row_iterator = same_day_orders.iterrows()
_, last_row = next(row_iterator)
for it in row_iterator:
idx, current_row = it
# Check if the next line order has the same 'ship_date' and a different 'order_id'...
same_shipping_date = (last_row.ship_date == current_row.ship_date)
different_order_id = (last_row.order_id is not current_row.order_id)
# ... if so, merge the rows by assigning the second line item the same 'order_id' as its predecessor.
if (same_shipping_date and different_order_id):
df.loc[idx, 'order_id'] = last_row.order_id
last_row = current_row
示例:
index customer_id order_id order_date ship_date
1234 C0176 S0159 2018-03-24 2018-04-23
1235 C0176 S0163 2018-03-24 2018-04-23
1236 C0176 S0163 2018-03-24 2018-04-23
1237 C0176 S0171 2018-03-24 2018-05-01
index customer_id order_id order_date ship_date
1234 C0176 S0159 2018-03-24 2018-04-23
1235 C0176 S0159 2018-03-24 2018-04-23
1236 C0176 S0159 2018-03-24 2018-04-23
1237 C0176 S0171 2018-03-24 2018-05-01
我该如何以更智能的方式(即更快地保持可读性)解决此问题?
答案 0 :(得分:2)
这对于transform
来说是一项出色的工作,它对分组的序列执行转换,但要确保结果的索引与输入的索引匹配(而不是像这样将组折叠成单个结果) agg
确实)。您可以像这样使用它:
# Get groups of equal customer_id, order_date, and ship_date:
groups = df.groupby(['customer_id', 'order_date', 'ship_date'])
# Get the last order_id value, but ensure its index matches df:
collapsed_orders = groups['order_id'].transform(lambda x: x.iloc[-1])
# Overwrite the original order_id with this new value:
df['order_id'] = collapsed_orders
或者,单线:
df['order_id'] = df.groupby(['customer_id', 'order_date', 'ship_date'])['order_id'].transform(lambda x: x.iloc[-1])