如何使用ddply从数据框中删除na值?

时间:2014-11-03 02:28:29

标签: r dataframe plyr na

希望你们能帮助我。我一直在寻找网络,我无法找到答案。 这是我的数据框:

name    city    state   stars    main_category
A   Pittsburgh  PA       5.0     Soul Food
B   Houston     TX       3.0     Professional Services
C   Lafayette   IN       3.0     NA
D   Los Angeles CA       4.0     Local Services
E   Los Angeles CA       3.0     Local Services
F   Lafayette   IN       3.5     Mongolian
G   Pittsburgh  PA       5.0     Doctors
H   Pittsburgh  PA       4.0     Soul Food
I   Houston     TX       4.0     Professional Services

我想要它做的是通过将城市(按字母顺序)与州分组来输出等级,然后按得到的星数进行排名。这就是我所希望的:

name    city    state   stars    main_category              rank
I   Houston     TX       4.0     Professional Services       1  
B   Houston     TX       3.0     Professional Services       2
F   Lafayette   IN       3.5     Mongolian                   1
D   Los Angeles CA       4.0     Local Services              1
E   Los Angeles CA       3.0     Local Services              2
G   Pittsburgh  PA       5.0     Doctors                     1
A   Pittsburgh  PA       5.0     Soul Food                   1
H   Pittsburgh  PA       4.0     Soul Food                   2

这是我的代码行。

l <- ddply(d, c("city", "state", "main_category"), na.rm=T, transform, rank=rank(-stars, ties.method="max"))

这并没有删除拉斐特所拥有的NA。而且我不知道该放什么,我也试过na.omit,但是当我尝试这个时,排名列没有显示出来。

3 个答案:

答案 0 :(得分:1)

这是一个基础R解决方案。不确定你是否已经开始使用dplyr,但这似乎有效。我认为最后一行应该排名3,因为有两个第一个值排在1

no <- na.omit(dat)
new <- no[do.call(order, with(no, list(city, state, -stars))),]
within(new, {
    rank  <- Reduce(c, Map(rank, split(-stars, city), ties.method = "min"))
})
#   name        city state stars         main_category rank
# 9    I     Houston    TX   4.0 Professional Services    1
# 2    B     Houston    TX   3.0 Professional Services    2
# 6    F   Lafayette    IN   3.5             Mongolian    1
# 4    D Los Angeles    CA   4.0        Local Services    1
# 5    E Los Angeles    CA   3.0        Local Services    2
# 1    A  Pittsburgh    PA   5.0             Soul Food    1
# 7    G  Pittsburgh    PA   5.0               Doctors    1
# 8    H  Pittsburgh    PA   4.0             Soul Food    3

答案 1 :(得分:0)

使用dplyr

library(dplyr)
filter(dat, complete.cases(dat)) %>%
                                group_by(city) %>% 
                                arrange(city, state, desc(stars)) %>% 
                                mutate(rank= min_rank(desc(stars)))
 #   name        city state stars         main_category rank
 #1    I     Houston    TX   4.0 Professional Services    1
 #2    B     Houston    TX   3.0 Professional Services    2
 #3    F   Lafayette    IN   3.5             Mongolian    1
 #4    D Los Angeles    CA   4.0        Local Services    1
 #5    E Los Angeles    CA   3.0        Local Services    2
 #6    A  Pittsburgh    PA   5.0             Soul Food    1
 #7    G  Pittsburgh    PA   5.0               Doctors    1
 #8    H  Pittsburgh    PA   4.0             Soul Food    3

答案 2 :(得分:0)

na.rm ,ddply会进入 .fun ,在你的情况下是在排名内。

你对NA的态度如下:

ddply(d,c(&#34; city&#34;,&#34;州&#34;,&#34; main_category&#34;), na.rm = T ,变换,排名=排名(-stars,ties.method =&#34; max&#34;))

.fun 中传递参数,应该修复它。至少它对我有用:

ddply(d, c("city", "state", "main_category"), transform, 
rank=rank(-stars, na.last = TRUE, ties.method="max"))