R中字符变量的最大字符串出现次数

时间:2016-01-22 14:29:30

标签: r data-cleansing

我的数据框有两列(医院名称,类型)。变量都是字符变量。数据如下所示: -

hospital_name  type
ABC            rural
ABC            rural
ABC            urban
XYZ            urban
XYZ            urban
EFG            rural

我正在编写一个代码,该代码将按医院名称分组,并计算该组中每种类型的数量。接下来,创建一个名为type2的新列,该列将具有类型列中出现次数最多的值。期望的输出应该是: -

hospital_name  type  type2
ABC            rural rural
XYZ            urban urban
EFG            rural rural        

我使用dplyr解决了这个问题,但是我收到了错误。这是我的解决方案: -

library("dplyr")
df<-df%>%group_by(hospital_name)%>%mutate(type2=names(which.max(table(type))))

错误是: -

Error: incompatible types, expecting a character vector

1 个答案:

答案 0 :(得分:5)

鉴于上面的代码运行没有错误,但没有产生所需的输出,我只是稍微调整它以获得你想要的东西:

dat <- dplyr::data_frame(hospital_name = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "EFG"), 
                         type = c("rural", "rural", "urban", "urban", "urban", "rural"))

dat %>% group_by(hospital_name) %>% 
  mutate(type2 = names(which.max(table(type)))) %>% 
  filter(type == type2) %>% 
  distinct()

dat
# Source: local data frame [3 x 3]
# Groups: hospital_name [3]
#
#   hospital_name  type type2
#           (chr) (chr) (chr)
# 1           ABC rural rural
# 2           XYZ urban urban
# 3           EFG rural rural

更新

上面的评论表明数据在NA列中有type,这似乎就是错误。但是,这似乎不是我机器上的问题。

dat <- data.frame(hospital_name = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "EFG"), 
                  type = c("rural", "rural", "urban", "urban", NA, "rural"))
dat
#   hospital_name  type
# 1           ABC rural
# 2           ABC rural
# 3           ABC urban
# 4           XYZ urban
# 5           XYZ  <NA>
# 6           EFG rural

sapply(dat, class)
# hospital_name          type 
#      "factor"      "factor" 

dat %>% 
  group_by(hospital_name) %>% 
  mutate(type2 = names(which.max(table(type))))

# Source: local data frame [6 x 3]
# Groups: hospital_name [3]

#   hospital_name   type type2
#          (fctr) (fctr) (chr)
# 1           ABC  rural rural
# 2           ABC  rural rural
# 3           ABC  urban rural
# 4           XYZ  urban urban
# 5           XYZ     NA urban
# 6           EFG  rural rural

更新2

所以我终于能够重现你的错误了。

dat <- structure(list(NET_PARENT = c("COMMUNITY HEALTH SYSTEMS (CHS)", 
"JEFFERSON HEALTH", "JEFFERSON HEALTH", "MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL)", 
"TENET HEALTHCARE", "TENET HEALTHCARE", "TENET HEALTHCARE", "TENET HEALTHCARE", 
"LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS)", "INDIAN HEALTH SERVICES"
), OWNERSHIP = c("for_profit", "non-profit", "non-profit", "non-profit", 
"for_profit", NA, NA, NA, "for_profit", NA)), .Names = c("NET_PARENT", 
"OWNERSHIP"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 10L, 
13L), class = "data.frame")

dat

#                                     NET_PARENT  OWNERSHIP
# 1               COMMUNITY HEALTH SYSTEMS (CHS) for_profit
# 2                             JEFFERSON HEALTH non-profit
# 3                             JEFFERSON HEALTH non-profit
# 4      MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL) non-profit
# 5                             TENET HEALTHCARE for_profit
# 6                             TENET HEALTHCARE       <NA>
# 7                             TENET HEALTHCARE       <NA>
# 8                             TENET HEALTHCARE       <NA>
# 10 LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS) for_profit
# 13                      INDIAN HEALTH SERVICES       <NA>

dat %>% group_by(NET_PARENT) %>% mutate(type2 = names(which.max(table(OWNERSHIP)))
# Error: incompatible types, expecting a character vector

这种情况正在发生,因为dat$NET_PARENT == "INDIAN HEALTH SERVICES"dat$NET_PARENT == "TENET HEALTHCARE"最受欢迎的选项是NA。这会在mutate中引发错误,因为它需要character值,而是获得NULL值。我们可以通过以下更改解决此问题。

dat %>%
  group_by(NET_PARENT) %>%
  mutate(type2 = ifelse(length(which.max(table(OWNERSHIP))) == 0,
                        "NA",
                        names(which.max(table(OWNERSHIP)))))

# Source: local data frame [10 x 3]
# Groups: NET_PARENT [6]

#                                     NET_PARENT  OWNERSHIP      type2
#                                          (chr)      (chr)      (chr)
# 1               COMMUNITY HEALTH SYSTEMS (CHS) for_profit for_profit
# 2                             JEFFERSON HEALTH non-profit non-profit
# 3                             JEFFERSON HEALTH non-profit non-profit
# 4      MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL) non-profit non-profit
# 5                             TENET HEALTHCARE for_profit for_profit
# 6                             TENET HEALTHCARE         NA for_profit
# 7                             TENET HEALTHCARE         NA for_profit
# 8                             TENET HEALTHCARE         NA for_profit
# 9  LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS) for_profit for_profit
# 10                      INDIAN HEALTH SERVICES         NA         NA

请注意type2是&#34; for_profit&#34;为&#34; TENET HEALTHCARE&#34;即使最大值为NA。这是因为table没有捕获NA,并且从值中省略了它。结果,唯一的值是记录为最大值。但对于&#34;印度健康服务&#34;,它被列为&#34; NA&#34;。