Question

由于各种原因，我想处理具有这种通用结构的Pandas DataFrame：

import pandas
exampledf = pandas.DataFrame([
    {'PersonId':'123','Interest':'Basketball','SubmittedDate':datetime.datetime.strptime('2018-04-18 13:00:08', '%Y-%m-%d %H:%M:%S'),'Question':'Cake or death?'},
    {'PersonId':'123','Interest':'Baseball','SubmittedDate':datetime.datetime.strptime('1999-01-01 09:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Swallow speed?'},
    {'PersonId':'456','Interest':'Swimming','SubmittedDate':datetime.datetime.strptime('2011-02-27 23:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Answer to life, universe, everything?'},
    {'PersonId':'123','Interest':'Basketball','SubmittedDate':datetime.datetime.strptime('2018-04-18 13:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'N/A'},
    {'PersonId':'789','Interest':'Racquetball','SubmittedDate':datetime.datetime.strptime('2018-05-02 12:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Will there be food?'},
    {'PersonId':'789','Interest':'Racquetball','SubmittedDate':datetime.datetime.strptime('2002-05-28 02:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Swag?'},
    {'PersonId':'789','Interest':'Racquetball','SubmittedDate':datetime.datetime.strptime('2018-05-02 12:00:00', '%Y-%m-%d %H:%M:%S'),'Question':'Good, thanks.'}
    ])
exampledf.set_index(['PersonId','Interest'], inplace=True)
print(exampledf)

因此看起来像这样：

                                                   Question       SubmittedDate
PersonId Interest                                                              
123      Basketball                          Cake or death? 2018-04-18 13:00:08
         Baseball                            Swallow speed? 1999-01-01 09:00:00
456      Swimming     Answer to life, universe, everything? 2011-02-27 23:00:00
123      Basketball                                     N/A 2018-04-18 13:00:00
789      Racquetball                    Will there be food? 2018-05-02 12:00:00
         Racquetball                                  Swag? 2002-05-28 02:00:00
         Racquetball                          Good, thanks. 2018-05-02 12:00:00

我希望将输出保持在与输入相同的结构中，但减去任何没有最新SubmittedDate的行，任意断开连接（找到第一行就可以了）。

我已经找到很多方法来完成（各种额外的剥离和重新添加索引）。例如：

我可以在exampledf.reset_index()之前执行.groupby()，然后在我完成之后再次.set_index()，但这似乎很尴尬

但我正在努力做到这一点。例如：

我可以.groupby(level=[0,1])，这会增加多余的“PersonId”＆amp; “兴趣”级别，这不会导致“.max（）”出现问题，并且使用.reset_index(level=[0,1], drop=True)可以恢复到一般的外观和感觉，但是当我试图挤进{关于“PersonId”，“Interest”和“SubmittedDate”的{1}}，我不能让它以不涉及更多分组的方式工作。复位。

例如，这会给我一个drop_duplicates()错误：

KeyError: 'PersonId'

就像这样：

lastsubmittedperlookuptiesbrokendf = exampledf.groupby(level=[0,1]).apply(lambda x: x[x['SubmittedDate'] == x['SubmittedDate'].max()]).reset_index(level=[0,1], drop=True, inplace=False).drop_duplicates(subset=['PersonId','Interest','SubmittedDate'])

获得以下输出的最Pythonic方式是什么？

lastsubmittedperlookuptiesbrokendf = exampledf.groupby(level=[0,1]).apply(lambda x: x[x['SubmittedDate'] == x['SubmittedDate'].max()]).drop_duplicates(subset=['PersonId','Interest','SubmittedDate']).reset_index(level=[0,1], drop=True, inplace=False)

（请注意，我目前的笨重实现重新排列了兴趣，但我不关心他们排序的顺序。）

Answer 1

由于排序速度快，速度快，所以不要过分担心仅仅max以上的额外工作，一种方法就是对SubmittedDate进行排序，然后在groupby之后取最后一次：

In [11]: exampledf.sort_values("SubmittedDate").groupby(level=[0,1]).last()
Out[11]: 
                                                   Question       SubmittedDate
PersonId Interest                                                              
123      Baseball                            Swallow speed? 1999-01-01 09:00:00
         Basketball                          Cake or death? 2018-04-18 13:00:08
456      Swimming     Answer to life, universe, everything? 2011-02-27 23:00:00
789      Racquetball                          Good, thanks. 2018-05-02 12:00:00

使用set_index（）对Pandas DataFrame进行分组和过滤的大多数Pythonic方法已经在最外层的分组级别执行了吗？

1 个答案: