熊猫:使用minrank在groupby之后排名

时间:2019-04-24 14:53:55

标签: python r pandas pandas-groupby

我知道rank中存在pandas.DataFrame.groupby方法,但是我想知道是否可以使用 min rank方法获得与R编程语言解决以下问题。

复制到我的github的数据集只有几MB。

我的尝试

import numpy as np
import pandas as pd

flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
print(flights.shape)


df = (flights[flights.tailnum.notna()]
      .assign( on_time = lambda x: x.arr_time.notna() & (x.arr_delay <=0))
      .groupby('tailnum')['on_time']
      .agg([np.mean,'count',pd.Series.rank(method='min')]) # R uses min_rank
      .set_axis(['on_time','n','rank'],axis=1,inplace=False)
      .query( 'rank == 1.0')
     )

df.head()

出现错误。

必需的输出

shape= 336776, 19

HEAD
tailnum on_time n
N121DE  0   2
N136DL  0   1
N143DA  0   1
N17627  0   2
N240AT  0   5
N26906  0   1

TAIL
tailnum on_time n
N939DN  0   1
N943DN  0   1
N953FR  0   3
N960DN  0   3
N965DN  0   2
N978SW  0   1

R代码运行良好,但我想使用熊猫

library(tidyverse)
library(nycflights13)
library(dplyr)

df = flights %>%
  filter(!is.na(tailnum)) %>%
  mutate(on_time = !is.na(arr_time) & (arr_delay <= 0)) %>%
  group_by(tailnum) %>%
  summarise(on_time = mean(on_time), n = n()) %>%
  filter(min_rank(on_time) == 1)


dim(flights)
head(df)
tail(df)

感谢您的帮助。

相关链接:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.rank.html

1 个答案:

答案 0 :(得分:1)

在R的dplyr中,min_rank不是聚合函数,而是 聚合后的计算(实际上受ANSI SQL 2003窗口函数RANK () OVER ()的启发,汇总函数)。因此,请在聚合后的熊猫数据帧 中而不是在agg()内添加这样的计算列。然后调用reindexdrop以排除帮助器列:

df = (flights[flights.tailnum.notna()]
      .assign( on_time = lambda x: x.arr_time.notna() & (x.arr_delay <=0))
      .groupby('tailnum')['on_time']
      .agg([np.mean, 'count'])
      .set_axis(['on_time','n'],axis=1, inplace=False)
      .assign(rank = lambda x: pd.Series.rank(x['on_time'], method='min'))
      .query("rank == 1") 
      .reindex(columns=['on_time', 'n']) # OR .drop(columns=['rank'])
     )

print(flights.shape)
# (336776, 19)

print(df.head())
#          on_time  n
# tailnum
# N121DE       0.0  2
# N136DL       0.0  1
# N143DA       0.0  1
# N17627       0.0  2
# N240AT       0.0  5

print(df.tail())
#          on_time  n
# tailnum
# N943DN       0.0  1
# N953FR       0.0  3
# N960DN       0.0  3
# N965DN       0.0  2
# N978SW       0.0  1