使用set_index()对Pandas DataFrame进行分组和过滤的大多数Pythonic方法已经在最外层的分组级别执行了吗?

时间:2018-05-18 16:25:46

标签: python pandas

由于各种原因,我想处理具有这种通用结构的Pandas DataFrame:

import pandas
exampledf = pandas.DataFrame([
    {'PersonId':'123','Interest':'Basketball','SubmittedDate':datetime.datetime.strptime('2018-04-18 13:00:08', '%Y-%m-%d %H:%M:%S'),'Question':'Cake or death?'},
    {'PersonId':'123','Interest':'Baseball','SubmittedDate':datetime.datetime.strptime('1999-01-01 09:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Swallow speed?'},
    {'PersonId':'456','Interest':'Swimming','SubmittedDate':datetime.datetime.strptime('2011-02-27 23:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Answer to life, universe, everything?'},
    {'PersonId':'123','Interest':'Basketball','SubmittedDate':datetime.datetime.strptime('2018-04-18 13:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'N/A'},
    {'PersonId':'789','Interest':'Racquetball','SubmittedDate':datetime.datetime.strptime('2018-05-02 12:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Will there be food?'},
    {'PersonId':'789','Interest':'Racquetball','SubmittedDate':datetime.datetime.strptime('2002-05-28 02:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Swag?'},
    {'PersonId':'789','Interest':'Racquetball','SubmittedDate':datetime.datetime.strptime('2018-05-02 12:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Good, thanks.'}
    ])
exampledf.set_index(['PersonId','Interest'], inplace=True)
print(exampledf)

因此看起来像这样:

                                                   Question       SubmittedDate
PersonId Interest                                                              
123      Basketball                          Cake or death? 2018-04-18 13:00:08
         Baseball                            Swallow speed? 1999-01-01 09:00:00
456      Swimming     Answer to life, universe, everything? 2011-02-27 23:00:00
123      Basketball                                     N/A 2018-04-18 13:00:00
789      Racquetball                    Will there be food? 2018-05-02 12:00:00
         Racquetball                                  Swag? 2002-05-28 02:00:00
         Racquetball                          Good, thanks. 2018-05-02 12:00:00

我希望将输出保持在与输入相同的结构中,但减去任何没有最新SubmittedDate的行,任意断开连接(找到第一行就可以了)。

我已经找到很多方法来完成(各种额外的剥离和重新添加索引)。例如:

  • 我可以在exampledf.reset_index()之前执行.groupby(),然后在我完成之后再次.set_index(),但这似乎很尴尬

但我正在努力做到这一点。例如:

  • 我可以.groupby(level=[0,1]),这会增加多余的“PersonId”& “兴趣”级别,这不会导致“.max()”出现问题,并且使用.reset_index(level=[0,1], drop=True)可以恢复到一般的外观和感觉,但是当我试图挤进{关于“PersonId”,“Interest”和“SubmittedDate”的{1}},我不能让它以不涉及更多分组的方式工作。复位。

例如,这会给我一个drop_duplicates()错误:

KeyError: 'PersonId'

就像这样:

lastsubmittedperlookuptiesbrokendf = exampledf.groupby(level=[0,1]).apply(lambda x: x[x['SubmittedDate'] == x['SubmittedDate'].max()]).reset_index(level=[0,1], drop=True, inplace=False).drop_duplicates(subset=['PersonId','Interest','SubmittedDate'])

获得以下输出的最Pythonic方式是什么?

lastsubmittedperlookuptiesbrokendf = exampledf.groupby(level=[0,1]).apply(lambda x: x[x['SubmittedDate'] == x['SubmittedDate'].max()]).drop_duplicates(subset=['PersonId','Interest','SubmittedDate']).reset_index(level=[0,1], drop=True, inplace=False)

(请注意,我目前的笨重实现重新排列了兴趣,但我不关心他们排序的顺序。)

1 个答案:

答案 0 :(得分:2)

由于排序速度快,速度快,所以不要过分担心仅仅max以上的额外工作,一种方法就是对SubmittedDate进行排序,然后在groupby之后取最后一次:

In [11]: exampledf.sort_values("SubmittedDate").groupby(level=[0,1]).last()
Out[11]: 
                                                   Question       SubmittedDate
PersonId Interest                                                              
123      Baseball                            Swallow speed? 1999-01-01 09:00:00
         Basketball                          Cake or death? 2018-04-18 13:00:08
456      Swimming     Answer to life, universe, everything? 2011-02-27 23:00:00
789      Racquetball                          Good, thanks. 2018-05-02 12:00:00