Question

我的数据框有两列（医院名称，类型）。变量都是字符变量。数据如下所示： -

hospital_name  type
ABC            rural
ABC            rural
ABC            urban
XYZ            urban
XYZ            urban
EFG            rural

我正在编写一个代码，该代码将按医院名称分组，并计算该组中每种类型的数量。接下来，创建一个名为type2的新列，该列将具有类型列中出现次数最多的值。期望的输出应该是： -

hospital_name  type  type2
ABC            rural rural
XYZ            urban urban
EFG            rural rural

我使用dplyr解决了这个问题，但是我收到了错误。这是我的解决方案： -

library("dplyr")
df<-df%>%group_by(hospital_name)%>%mutate(type2=names(which.max(table(type))))

错误是： -

Error: incompatible types, expecting a character vector

Answer 1

鉴于上面的代码运行没有错误，但没有产生所需的输出，我只是稍微调整它以获得你想要的东西：

dat <- dplyr::data_frame(hospital_name = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "EFG"), 
                         type = c("rural", "rural", "urban", "urban", "urban", "rural"))

dat %>% group_by(hospital_name) %>% 
  mutate(type2 = names(which.max(table(type)))) %>% 
  filter(type == type2) %>% 
  distinct()

dat
# Source: local data frame [3 x 3]
# Groups: hospital_name [3]
#
#   hospital_name  type type2
#           (chr) (chr) (chr)
# 1           ABC rural rural
# 2           XYZ urban urban
# 3           EFG rural rural

更新

上面的评论表明数据在NA列中有type，这似乎就是错误。但是，这似乎不是我机器上的问题。

dat <- data.frame(hospital_name = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "EFG"), 
                  type = c("rural", "rural", "urban", "urban", NA, "rural"))
dat
#   hospital_name  type
# 1           ABC rural
# 2           ABC rural
# 3           ABC urban
# 4           XYZ urban
# 5           XYZ  <NA>
# 6           EFG rural

sapply(dat, class)
# hospital_name          type 
#      "factor"      "factor" 

dat %>% 
  group_by(hospital_name) %>% 
  mutate(type2 = names(which.max(table(type))))

# Source: local data frame [6 x 3]
# Groups: hospital_name [3]

#   hospital_name   type type2
#          (fctr) (fctr) (chr)
# 1           ABC  rural rural
# 2           ABC  rural rural
# 3           ABC  urban rural
# 4           XYZ  urban urban
# 5           XYZ     NA urban
# 6           EFG  rural rural

更新2

所以我终于能够重现你的错误了。

dat <- structure(list(NET_PARENT = c("COMMUNITY HEALTH SYSTEMS (CHS)", 
"JEFFERSON HEALTH", "JEFFERSON HEALTH", "MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL)", 
"TENET HEALTHCARE", "TENET HEALTHCARE", "TENET HEALTHCARE", "TENET HEALTHCARE", 
"LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS)", "INDIAN HEALTH SERVICES"
), OWNERSHIP = c("for_profit", "non-profit", "non-profit", "non-profit", 
"for_profit", NA, NA, NA, "for_profit", NA)), .Names = c("NET_PARENT", 
"OWNERSHIP"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 10L, 
13L), class = "data.frame")

dat

#                                     NET_PARENT  OWNERSHIP
# 1               COMMUNITY HEALTH SYSTEMS (CHS) for_profit
# 2                             JEFFERSON HEALTH non-profit
# 3                             JEFFERSON HEALTH non-profit
# 4      MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL) non-profit
# 5                             TENET HEALTHCARE for_profit
# 6                             TENET HEALTHCARE       <NA>
# 7                             TENET HEALTHCARE       <NA>
# 8                             TENET HEALTHCARE       <NA>
# 10 LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS) for_profit
# 13                      INDIAN HEALTH SERVICES       <NA>

dat %>% group_by(NET_PARENT) %>% mutate(type2 = names(which.max(table(OWNERSHIP)))
# Error: incompatible types, expecting a character vector

这种情况正在发生，因为dat$NET_PARENT == "INDIAN HEALTH SERVICES"和dat$NET_PARENT == "TENET HEALTHCARE"最受欢迎的选项是NA。这会在mutate中引发错误，因为它需要character值，而是获得NULL值。我们可以通过以下更改解决此问题。

dat %>%
  group_by(NET_PARENT) %>%
  mutate(type2 = ifelse(length(which.max(table(OWNERSHIP))) == 0,
                        "NA",
                        names(which.max(table(OWNERSHIP)))))

# Source: local data frame [10 x 3]
# Groups: NET_PARENT [6]

#                                     NET_PARENT  OWNERSHIP      type2
#                                          (chr)      (chr)      (chr)
# 1               COMMUNITY HEALTH SYSTEMS (CHS) for_profit for_profit
# 2                             JEFFERSON HEALTH non-profit non-profit
# 3                             JEFFERSON HEALTH non-profit non-profit
# 4      MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL) non-profit non-profit
# 5                             TENET HEALTHCARE for_profit for_profit
# 6                             TENET HEALTHCARE         NA for_profit
# 7                             TENET HEALTHCARE         NA for_profit
# 8                             TENET HEALTHCARE         NA for_profit
# 9  LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS) for_profit for_profit
# 10                      INDIAN HEALTH SERVICES         NA         NA

请注意type2是＆＃34; for_profit＆＃34;为＆＃34; TENET HEALTHCARE＆＃34;即使最大值为NA。这是因为table没有捕获NA，并且从值中省略了它。结果，唯一的值是记录为最大值。但对于＆＃34;印度健康服务＆＃34;，它被列为＆＃34; NA＆＃34;。

R中字符变量的最大字符串出现次数

1 个答案:

更新

更新2