Question

我发布了this question，需要扩展应用程序。我现在需要获得每个N的{{1}}最长日期：

Vendor

如果我需要获得第二个最大日期预期输出将是：

#import pandas as pd
#df = pd.read_clipboard()
#df['Insert_Date'] = pd.to_datetime(df['Insert_Date'])

# used in example below 
#df2 = df.sort_values(['Vendor','InsertDate']).drop_duplicates(['Vendor'],keep='last') 

Vendor  Insert_Date Total 
Steph   2017-10-25  2
Matt    2017-10-31  13
Chris   2017-11-03  3
Steve   2017-10-23  11
Chris   2017-10-27  3
Steve   2017-11-01  11

我可以在示例Vendor Insert_Date Total Steph 2017-10-25 2 Steve 2017-10-23 11 Matt 2017-10-31 13 Chris 2017-10-27 3中使用df2轻松获得第二个最大日期，但如果我需要获得第50个最大值，那么构建的数据帧很多{{1} } ...

我也试过df.loc[~df.index.isin(df2.index)]让我接近，但我需要为每个供应商获取isin()值。

我也尝试过滤出供应商的df：

df.groupby('Vendor')['Insert_Date'].nlargest(N_HERE)

但如果我尝试使用N获取第二条记录，则返回：df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)。相反，我需要使用df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)[2]。为什么我必须在这里使用列表切片而不仅仅是Timestamp('2017-11-03 00:00:00')？

总结一下？如何按供应商退回df.loc[df['Vendor']=='Chris', 'Insert_Date'].nlargest(2)[1:2]最大日期？

Answer 1

我可能会误解你最初的问题。您可以对groupby进行排序，然后以这种方式使用apply + n = 9 df.sort_values('Insert_Date')\ .groupby('Vendor', as_index=False).apply(lambda x: x.iloc[-n])：

n = 0

对于您的示例数据，似乎df.sort_values('Insert_Date')\ .groupby('Vendor', as_index=False).apply(lambda x: x.iloc[0]) Vendor Insert_Date Total 0 Chris 2017-10-27 3 1 Matt 2017-10-31 13 2 Steph 2017-10-25 2 3 Steve 2017-10-23 11可以解决问题。

Vendor

请注意，如果n组的尺寸小于select * from 3rdTable where City+'|'+State= (select a.City+'|'+b.State from a inner join b on a.x=b.y)，则此代码会抛出错误。

Answer 2

我将使用head（你可以在这里选择前n个我使用2）并且最后一个drop_duplicates。

df.sort_values('Insert_Date',ascending=False).groupby('Vendor').\
     head(2).drop_duplicates('Vendor',keep='last').sort_index()
Out[609]: 
  Vendor Insert_Date  Total
0  Steph  2017-10-25      2
1   Matt  2017-10-31     13
3  Steve  2017-10-23     11
4  Chris  2017-10-27      3

Answer 3

我喜欢@COLDSPEED的答案，因为它更直接。这是使用nlargest的一个，它涉及创建nthlargest列的中间步骤

n = 2
df1['nth_largest'] = df1.groupby('Vendor').Insert_Date.transform(lambda x: x.nlargest(n).min())
df1.drop_duplicates(subset = ['Vendor', 'nth_largest']).drop('Insert_Date', axis = 1)


    Vendor  Total   nth_largest
0   Steph   2   2017-10-25
1   Matt    13  2017-10-31
2   Chris   3   2017-10-27
3   Steve   11  2017-10-23

获得N个最大的日期熊猫

3 个答案: