我有一个数据框如下。
Hospital State Mortality Rank
aaa AK 9.7 1
bbb AK 10.5 2
ccc AK 11.3 3
ddd AL 5.6 1
eee AL 8.7 2
fff AL 9.1 3
ggg AL 9.3 4
hhh AR 9.9 1
iii AR 10.2 2
jjj TX 6.5 1
kkk TX 6.5 2
lll TX 8.3 3
mmm TX 8.4 4
可再现性
df <- data.frame(Hospital=c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","jjj","kkk","lll","mmm"),State=c("AK","AK","AK","AL","AL","AL","AL","AR","AR","AZ","AZ","AZ","AZ"), Mortality=c(9.7,10.5,11.3,5.6,8.7,9.1,9.3,9.9,10.2,6.5,6.5,8.3,8.4),Rank=c(1,2,3,1,2,3,4,1,2,1,2,3,4))
当我搜索排名第4的医院时,我想得到如下结果,在每个没有通过排名的州内返回医院的NA
Hospital State
NA AK
ggg AL
NA AR
mmm TX
目前我只获得那些包含值为4的等级的行。
Hospital State
ggg AL
mmm TX
除了创建一个包含4行的df之外,还有一种更快的方法,即为那些没有预期等级值然后过滤它们的状态留下NA的医院。
答案 0 :(得分:1)
您可以使用merge
获取此结果并将all.y参数设置为TRUE:
merge(df[df$Rank == 4,], unique(df["State"]), all.y=TRUE)
State Hospital Mortality Rank
1 AK <NA> NA NA
2 AL ggg 9.3 4
3 AR <NA> NA NA
4 AZ mmm 8.4 4
这里的想法是获取一个带有单个变量的唯一状态名称的data.frame,并将其合并到包含等级4的医院的data.frame上。由于带有状态的data.frame是第二个参数,{ {1}}告诉merge将所有状态保存在最终的data.frame中。
要仅返回两列,您可以进一步将第一个参数子集化为keep.y=TRUE
,如
merge
答案 1 :(得分:0)
来自dplyr
的解决方案。
library(dplyr)
df2 <- df %>%
group_by(State) %>%
summarise(Rank = max(Rank)) %>%
left_join(df, by = c("State", "Rank")) %>%
mutate(Hospital = ifelse(Rank < 4, NA_character_, as.character(Hospital))) %>%
select(Hospital, State)
df2
# A tibble: 4 x 2
Hospital State
<chr> <fctr>
1 <NA> AK
2 ggg AL
3 <NA> AR
4 mmm AZ