我的数据框有两列(医院名称,类型)。变量都是字符变量。数据如下所示: -
hospital_name type
ABC rural
ABC rural
ABC urban
XYZ urban
XYZ urban
EFG rural
我正在编写一个代码,该代码将按医院名称分组,并计算该组中每种类型的数量。接下来,创建一个名为type2的新列,该列将具有类型列中出现次数最多的值。期望的输出应该是: -
hospital_name type type2
ABC rural rural
XYZ urban urban
EFG rural rural
我使用dplyr解决了这个问题,但是我收到了错误。这是我的解决方案: -
library("dplyr")
df<-df%>%group_by(hospital_name)%>%mutate(type2=names(which.max(table(type))))
错误是: -
Error: incompatible types, expecting a character vector
答案 0 :(得分:5)
鉴于上面的代码运行没有错误,但没有产生所需的输出,我只是稍微调整它以获得你想要的东西:
dat <- dplyr::data_frame(hospital_name = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "EFG"),
type = c("rural", "rural", "urban", "urban", "urban", "rural"))
dat %>% group_by(hospital_name) %>%
mutate(type2 = names(which.max(table(type)))) %>%
filter(type == type2) %>%
distinct()
dat
# Source: local data frame [3 x 3]
# Groups: hospital_name [3]
#
# hospital_name type type2
# (chr) (chr) (chr)
# 1 ABC rural rural
# 2 XYZ urban urban
# 3 EFG rural rural
上面的评论表明数据在NA
列中有type
,这似乎就是错误。但是,这似乎不是我机器上的问题。
dat <- data.frame(hospital_name = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "EFG"),
type = c("rural", "rural", "urban", "urban", NA, "rural"))
dat
# hospital_name type
# 1 ABC rural
# 2 ABC rural
# 3 ABC urban
# 4 XYZ urban
# 5 XYZ <NA>
# 6 EFG rural
sapply(dat, class)
# hospital_name type
# "factor" "factor"
dat %>%
group_by(hospital_name) %>%
mutate(type2 = names(which.max(table(type))))
# Source: local data frame [6 x 3]
# Groups: hospital_name [3]
# hospital_name type type2
# (fctr) (fctr) (chr)
# 1 ABC rural rural
# 2 ABC rural rural
# 3 ABC urban rural
# 4 XYZ urban urban
# 5 XYZ NA urban
# 6 EFG rural rural
所以我终于能够重现你的错误了。
dat <- structure(list(NET_PARENT = c("COMMUNITY HEALTH SYSTEMS (CHS)",
"JEFFERSON HEALTH", "JEFFERSON HEALTH", "MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL)",
"TENET HEALTHCARE", "TENET HEALTHCARE", "TENET HEALTHCARE", "TENET HEALTHCARE",
"LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS)", "INDIAN HEALTH SERVICES"
), OWNERSHIP = c("for_profit", "non-profit", "non-profit", "non-profit",
"for_profit", NA, NA, NA, "for_profit", NA)), .Names = c("NET_PARENT",
"OWNERSHIP"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 10L,
13L), class = "data.frame")
dat
# NET_PARENT OWNERSHIP
# 1 COMMUNITY HEALTH SYSTEMS (CHS) for_profit
# 2 JEFFERSON HEALTH non-profit
# 3 JEFFERSON HEALTH non-profit
# 4 MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL) non-profit
# 5 TENET HEALTHCARE for_profit
# 6 TENET HEALTHCARE <NA>
# 7 TENET HEALTHCARE <NA>
# 8 TENET HEALTHCARE <NA>
# 10 LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS) for_profit
# 13 INDIAN HEALTH SERVICES <NA>
dat %>% group_by(NET_PARENT) %>% mutate(type2 = names(which.max(table(OWNERSHIP)))
# Error: incompatible types, expecting a character vector
这种情况正在发生,因为dat$NET_PARENT == "INDIAN HEALTH SERVICES"
和dat$NET_PARENT == "TENET HEALTHCARE"
最受欢迎的选项是NA
。这会在mutate
中引发错误,因为它需要character
值,而是获得NULL
值。我们可以通过以下更改解决此问题。
dat %>%
group_by(NET_PARENT) %>%
mutate(type2 = ifelse(length(which.max(table(OWNERSHIP))) == 0,
"NA",
names(which.max(table(OWNERSHIP)))))
# Source: local data frame [10 x 3]
# Groups: NET_PARENT [6]
# NET_PARENT OWNERSHIP type2
# (chr) (chr) (chr)
# 1 COMMUNITY HEALTH SYSTEMS (CHS) for_profit for_profit
# 2 JEFFERSON HEALTH non-profit non-profit
# 3 JEFFERSON HEALTH non-profit non-profit
# 4 MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL) non-profit non-profit
# 5 TENET HEALTHCARE for_profit for_profit
# 6 TENET HEALTHCARE NA for_profit
# 7 TENET HEALTHCARE NA for_profit
# 8 TENET HEALTHCARE NA for_profit
# 9 LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS) for_profit for_profit
# 10 INDIAN HEALTH SERVICES NA NA
请注意type2
是&#34; for_profit&#34;为&#34; TENET HEALTHCARE&#34;即使最大值为NA
。这是因为table
没有捕获NA
,并且从值中省略了它。结果,唯一的值是记录为最大值。但对于&#34;印度健康服务&#34;,它被列为&#34; NA&#34;。