根据条件重新编码所有变量

时间:2019-08-25 01:35:20

标签: r dplyr data-manipulation data-cleaning

我正在尝试将数据集中“同意/不同意”标度的所有变量重新编码为数值。我试过使用mutate_all和case_when,但随后它会返回ID列和var3(以下数据)等变量的NA值。这是我正在使用的代码:

newdat <- olddat %>% mutate_all(funs(case_when(. == "Strongly Disagree (1)" ~ 1,
                                               . == "Disagree (2)" ~ 2,
                                               . == "Neutral (3)" ~ 3,
                                               . == "Agree (4)" ~ 4,
                                               . == "Strongly Agree (5)" ~ 5)))

我想发生的事情如下:

有数据

id     var1                      var2           var3      var4
 1     Strongly Disagree (1)     Agree (4)      5         Agree (4)
 2     Strongly Disagree (1)     Neutral (3)    6         Neutral (3)
 3     Disagree (2)              Neutral (3)    4         Strongly Agree (5)
 4     Strongly Disagree (1)     Agree (4)      9         Disagree (2)
 5     Neutral (3)               Agree (4)      2         Agree (4)

想要的数据

id     var1   var2   var3   var4
 1     1      4      5      4
 2     1      3      6      3
 3     2      3      4      5
 4     1      4      9      2
 5     3      4      2      4

P.S。 试图寻找一个现有的答案,但我找不到!也许我说错了什么?

3 个答案:

答案 0 :(得分:4)

您可以简单地从每个单元格中提取数字代码,因为您已经在括号中添加了它。无需recode。这是使用stringr::str_extract()-

的方法
have %>% 
  mutate_at(vars(starts_with("var")), ~as.integer(str_extract(x, "[0-9]")))

答案 1 :(得分:3)

您需要使用mutate_at而不是mutate_all,因为您只想更改选定的列,因为默认情况下,case_when中不匹配的值将变为NA

library(dplyr)

df %>% mutate_at(vars(var1, var2, var4), 
                     ~(case_when(. == "Strongly Disagree (1)" ~ 1,
                                 . == "Disagree (2)" ~ 2,
                                 . == "Neutral (3)" ~ 3,
                                 . == "Agree (4)" ~ 4,
                                 . == "Strongly Agree (5)" ~ 5)))

#  id var1 var2 var3 var4
#1  1    1    4    5    4
#2  2    1    3    6    3
#3  3    2    3    4    5
#4  4    1    4    9    2
#5  5    3    4    2    4

由于有许多列要执行此操作,因此我们首先可以找出需要更改的列,然后使用mutate_at

cols <- which(colSums(sapply(df, grepl, pattern =  "Agree|Disagree")) > 0)

df %>%
    mutate_at(cols, ~case_when(. == "Strongly Disagree (1)" ~ 1,
                    . == "Disagree (2)" ~ 2,
                    . == "Neutral (3)" ~ 3,
                    . == "Agree (4)" ~ 4,
                    . == "Strongly Agree (5)" ~ 5))

答案 2 :(得分:1)

这看起来很难看,我相信有更简单的解决方案,但是应该可以解决:

newdat <- as.data.frame(sapply(1:ncol(olddat), function(x){if(x %in% c(1,4)){return(olddat[x])}else{return(sapply(olddat[x], function(y){as.numeric(gsub("[()]","",strsplit(y, split = " ")[[1]][2]))}))}}))

它的作用基本上是遍历每一列。如果是第一列或第四列,则按原样返回该列。如果有其他情况:用strsplit()在空白处分割每个单元格,然后取下半部分,用gsub()除去方括号,然后用as.numeric()将其转换为数字。

修改:

如果您有很多列,并且不想手动指定它们,则可以按列类进行过滤:

newdat <- as.data.frame(sapply(1:ncol(olddat), function(x){if(class(x) == "integer"){return(olddat[x])}else{return(sapply(olddat[x], function(y){as.numeric(gsub("[()]","",strsplit(y, split = " ")[[1]][2]))}))}}))