我正在学习Hadly的书“ R for Data Science”,并试图 在熊猫中复制代码。
我遇到了这个问题:
我必须根据延迟时间创建一个新的等级列
排期并仅过滤它们的最小值和最大值。
R码:
library(nycflights13)
library(dplyr)
# remove nans
not_cancelled = flights %>%
filter( !is.na(dep_delay), !is.na(arr_delay))
# create new column of rank based on dep_time for each day.
df = not_cancelled %>%
group_by(year,month,day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r)) %>% # filter only first and last value
select(year,month,day,dep_delay,arr_delay,r)
dim(df)
head(df,10)
这给出了:
m=month d =day dl = dep_delay ad = arr_delay r =r
year m d dl ad r
2013 1 1 2 11 831
2013 1 1 -3 -12 1
2013 1 2 43 36 928
2013 1 2 -5 -24 1
2013 1 3 33 22 900
2013 1 3 -10 -11 1
2013 1 4 26 23 908
2013 1 4 -1 -8 1
2013 1 4 -1 -9 1 # Behold! january 4 has 3 rows!!
2013 1 5 15 18 717
我正试图在大熊猫中复制它:
df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
# print(df.shape)
# print(df.iloc[:5,:5])
not_cancelled = df.dropna(subset=['dep_delay','arr_delay'])
df['r'] = not_cancelled.groupby(['year','month','day'])['dep_time']\
.rank('min',ascending=False)
g = df.groupby(['year','month','day'])['r']
g = g.agg([min,max]).reset_index()
f = g.head()
print(f)
Python输出:
(336776, 19)
year month day min max
0 2013 1 1 1.0 831.0
1 2013 1 2 1.0 928.0
2 2013 1 3 1.0 900.0
3 2013 1 4 1.0 908.0
4 2013 1 5 1.0 717.0
这不太正确。怎么做正确的事?
我们非常感谢您的帮助。冰雹大熊猫!
答案 0 :(得分:4)
这是正确的输出,您只需要调整输出的形状
方法1 stack
g = df.groupby(['year','month','day'])['r']
g = g.agg([min,max]).stack()
g=g.reset_index(level=[0,1,2])
方法2 melt
g=df.groupby(['year','month','day'])['r'].agg([min,max])
g.reset_index().melt(['year','month','day'])
更新
g = df.groupby(['year','month','day'])['r']
g_max = g.transform('max')
g_min = g.transform('min')
yourdf=df.loc[(df.r==g_max)|(df.r==g_min),['year','month','day','r']]
答案 1 :(得分:1)
我创建了两个等级,其中最大值为1,最小值为1。
然后我可以获得最大或最小排名为1的行。
但这给了我两列-一列为r_max
,一列为r_min
import pandas as pd
df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
# print(df.shape)
# print(df.iloc[:5,:5])
not_cancelled = df.dropna(subset=['dep_delay','arr_delay'])
gr = not_cancelled.groupby(['year','month','day'])
df['r_min'] = gr['dep_time'].rank('min', ascending=False)
df['r_max'] = gr['dep_time'].rank('max', ascending=True)
result = df[(df['r_min'] == 1) | (df['r_max'] == 1)]
print(result[['year','month','day','dep_delay','arr_delay','r_min', 'r_max']].head(10))
结果-january 4
的三行
year month day dep_delay arr_delay r_min r_max
0 2013 1 1 2.0 11.0 831.0 1.0
837 2013 1 1 -3.0 -12.0 1.0 831.0
842 2013 1 2 43.0 36.0 928.0 1.0
1776 2013 1 2 -5.0 -24.0 1.0 928.0
1785 2013 1 3 33.0 22.0 900.0 1.0
2688 2013 1 3 -10.0 -11.0 1.0 900.0
2699 2013 1 4 26.0 23.0 908.0 1.0
3606 2013 1 4 -1.0 -8.0 1.0 908.0
3607 2013 1 4 -1.0 -9.0 1.0 908.0
3614 2013 1 5 15.0 18.0 717.0 1.0