问题:
我希望在每个配对的uid和零售商中只选择最新的价格记录。
数据:
import pandas as pd
import numpy as np
data = {"uid":{"0":"123","1":"345","2":"678","3":"123","4":"345","5":"123","6":"678","7":"369","8":"890","9":"678"},"retailer":{"0":"GUY","1":"GUY","2":"GUY","3":"GUY","4":"GUY","5":"GAL","6":"GUY","7":"GAL","8":"GAL","9":"GUY"},"upload date":{"0":"11/17/17","1":"11/17/17","2":"11/16/17","3":"11/16/17","4":"11/16/17","5":"11/17/17","6":"11/17/17","7":"11/17/17","8":"11/17/17","9":"11/15/17"},"price":{"0":12.00,"1":1.23, "2":34.00, "3":69.69, "4":13.72, "5":49.98, "6":98.02, "7":1.02,"8":98.23,"9":12.69}}
df = pd.DataFrame(data=data)
df = df[['uid','retailer','upload date','price']]
df['upload date']=pd.to_datetime(df['upload date'])
解决方案:
idx = df.groupby(['uid','retailer'])['upload date'].max().rename('upload date')
idx.reset_index(inplace=True)
solution = idx.merge(df, how='left', on=['uid','retailer','upload date'])
问题:
我希望能够利用指数来实现我的解决方案。或者,我希望能够使用连接,或者使用保留原始数据帧的索引的函数找到每个配对的最大日期。
加入错误:
idx.set_index(['uid','retailer','upload date']).join(df, on=['uid','retailer','upload date'])
返回:
ValueError: len(left_on) must equal the number of levels in the index of "right"
答案 0 :(得分:1)
IIUC,idxmax
df.loc[df.groupby(['uid','retailer'])['upload date'].idxmax()]
Out[168]:
uid retailer upload date price
5 123 GAL 2017-11-17 49.98
0 123 GUY 2017-11-17 12.00
1 345 GUY 2017-11-17 1.23
7 369 GAL 2017-11-17 1.02
6 678 GUY 2017-11-17 98.02
8 890 GAL 2017-11-17 98.23
或reindex
df.reindex(df.groupby(['uid','retailer'])['upload date'].idxmax().values)
如果您希望文档<{1}} :在关键列的索引或上加入与其他DataFrame相关的列
join
要获得预期的输出,您需要在最后添加idx.set_index(['uid','retailer','upload date']).join(df.set_index(['uid','retailer','upload date']))
Out[175]:
price
uid retailer upload date
123 GAL 2017-11-17 49.98
GUY 2017-11-17 12.00
345 GUY 2017-11-17 1.23
369 GAL 2017-11-17 1.02
678 GUY 2017-11-17 98.02
890 GAL 2017-11-17 98.23
或做类似
的事情.reset_index()