由于各种原因,我想处理具有这种通用结构的Pandas DataFrame:
import pandas
exampledf = pandas.DataFrame([
{'PersonId':'123','Interest':'Basketball','SubmittedDate':datetime.datetime.strptime('2018-04-18 13:00:08', '%Y-%m-%d %H:%M:%S'),'Question':'Cake or death?'},
{'PersonId':'123','Interest':'Baseball','SubmittedDate':datetime.datetime.strptime('1999-01-01 09:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Swallow speed?'},
{'PersonId':'456','Interest':'Swimming','SubmittedDate':datetime.datetime.strptime('2011-02-27 23:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Answer to life, universe, everything?'},
{'PersonId':'123','Interest':'Basketball','SubmittedDate':datetime.datetime.strptime('2018-04-18 13:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'N/A'},
{'PersonId':'789','Interest':'Racquetball','SubmittedDate':datetime.datetime.strptime('2018-05-02 12:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Will there be food?'},
{'PersonId':'789','Interest':'Racquetball','SubmittedDate':datetime.datetime.strptime('2002-05-28 02:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Swag?'},
{'PersonId':'789','Interest':'Racquetball','SubmittedDate':datetime.datetime.strptime('2018-05-02 12:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Good, thanks.'}
])
exampledf.set_index(['PersonId','Interest'], inplace=True)
print(exampledf)
因此看起来像这样:
Question SubmittedDate
PersonId Interest
123 Basketball Cake or death? 2018-04-18 13:00:08
Baseball Swallow speed? 1999-01-01 09:00:00
456 Swimming Answer to life, universe, everything? 2011-02-27 23:00:00
123 Basketball N/A 2018-04-18 13:00:00
789 Racquetball Will there be food? 2018-05-02 12:00:00
Racquetball Swag? 2002-05-28 02:00:00
Racquetball Good, thanks. 2018-05-02 12:00:00
我希望将输出保持在与输入相同的结构中,但减去任何没有最新SubmittedDate的行,任意断开连接(找到第一行就可以了)。
我已经找到很多方法来完成(各种额外的剥离和重新添加索引)。例如:
exampledf.reset_index()
之前执行.groupby()
,然后在我完成之后再次.set_index()
,但这似乎很尴尬但我正在努力做到这一点。例如:
.groupby(level=[0,1])
,这会增加多余的“PersonId”& “兴趣”级别,这不会导致“.max()”出现问题,并且使用.reset_index(level=[0,1], drop=True)
可以恢复到一般的外观和感觉,但是当我试图挤进{关于“PersonId”,“Interest”和“SubmittedDate”的{1}},我不能让它以不涉及更多分组的方式工作。复位。例如,这会给我一个drop_duplicates()
错误:
KeyError: 'PersonId'
就像这样:
lastsubmittedperlookuptiesbrokendf = exampledf.groupby(level=[0,1]).apply(lambda x: x[x['SubmittedDate'] == x['SubmittedDate'].max()]).reset_index(level=[0,1], drop=True, inplace=False).drop_duplicates(subset=['PersonId','Interest','SubmittedDate'])
获得以下输出的最Pythonic方式是什么?
lastsubmittedperlookuptiesbrokendf = exampledf.groupby(level=[0,1]).apply(lambda x: x[x['SubmittedDate'] == x['SubmittedDate'].max()]).drop_duplicates(subset=['PersonId','Interest','SubmittedDate']).reset_index(level=[0,1], drop=True, inplace=False)
(请注意,我目前的笨重实现重新排列了兴趣,但我不关心他们排序的顺序。)
答案 0 :(得分:2)
由于排序速度快,速度快,所以不要过分担心仅仅max
以上的额外工作,一种方法就是对SubmittedDate进行排序,然后在groupby之后取最后一次:
In [11]: exampledf.sort_values("SubmittedDate").groupby(level=[0,1]).last()
Out[11]:
Question SubmittedDate
PersonId Interest
123 Baseball Swallow speed? 1999-01-01 09:00:00
Basketball Cake or death? 2018-04-18 13:00:08
456 Swimming Answer to life, universe, everything? 2011-02-27 23:00:00
789 Racquetball Good, thanks. 2018-05-02 12:00:00