DataCamp介绍dplyr课程的几个基本问题:
为什么:
hflights %>%
group_by(UniqueCarrier,Dest) %>%
summarize(n=n()) %>%
mutate(rank=rank(n)) %>%
filter(rank==1)
生成与以下不同的答案:
hflights %>%
group_by(UniqueCarrier, Dest) %>%
summarise(n = n()) %>%
mutate(rank = rank(desc(n))) %>%
filter(rank == 1)
唯一的区别是排名顺序,但不应过滤与项目排名顺序不相关吗?
其次,为什么mean(ArrDelay> 0)在下面的情况下生成ArrDelay> 0的航班比例?难道它只是给你所有具有正延迟的航班的平均延迟吗?
hflights %>%
filter(!is.na(ArrDelay)) %>%
group_by(UniqueCarrier) %>%
summarize(p_delay=mean(ArrDelay>0)) %>%
mutate(rank=rank(p_delay)) %>%
arrange(rank)
谢谢!
答案 0 :(得分:2)
I don't really understand the first question. Why would you expect the same results? Have a look at what desc
actually does, e.g. desc(1:3)
. Clearly the ranks should be different.
rank(1:3)
## [1] 1 2 3
rank(desc(1:3))
## [1] 3 2 1
For your second question: ArrDelay > 0
is a logical. When you take the mean of a logical, it converts it to numeric first (TRUE -> 1, FALSE -> 0). Then it takes the mean, which is the proportion of TRUEs. To get the mean of all delays with positive delay, use
hflights %>%
filter(!is.na(ArrDelay)) %>%
group_by(UniqueCarrier) %>%
summarize(p_delay=mean(ArrDelay[ArrDelay>0])) %>%
mutate(rank=rank(p_delay)) %>%
arrange(rank)