R:在众多列中的多个单元格中替换具有整数值的因子

时间:2016-05-25 08:39:05

标签: r

所以,我的挑战是将原始比例csv转换为得分csv。在众多列中,文件的单元格中填充了“非常同意”到“非常不同意”的6个级别。这些因子需要分别以整数5到0进行转换。

我尝试过使用sapply并将表转换为字符串失败了。 Sapply在矢量上工作,但它破坏了表结构。

方法1:

dat$Col<-sapply(dat$Col,switch,'Strongly Disagree'=0,'Disagree'=1,'Slightly Disagree'=2,'Slightly Agree'=3,'Agree'=4, 'Strongly Agree'=5)

我的第二种方法是将csv转换为字符串。当我检查输出输出时,我看到了我想要定位的区域,开始于.Label =“”,“非常同意”......错误。我的更改没有产生有用的结果。

我的第三种方法来自互联网的破坏之神,似乎表达了gsub()也可以处理字符串方法。不,再次,基础表结构被破坏了。

方法#3:转换为字符串和模式匹配

dat <- textConnection("control/Surveys/StudyDat_1.csv")
#Score Scales
##"Strongly Agree"= 5
##"Agree"= 4
##"Strongly Disagree" = 0
#levels(dat$Col) <- gsub("Strongly Agree", "5", levels(dat$Col))
    df<- gsub("Strongly Agree", "5",dat)
    dat<-read.csv(textConnection(df),header=TRUE)

最后,我希望在众多列中将所有“强烈同意”替换为5,而不会破坏数据的可检索性。

也许我使用了错误的搜索字符串,您知道解决此问题所需的资源。我宁愿避免使用所有字符向量方法,因为如果您提供代码响应,则需要标记每个列。它需要遍历所有列。

由于

数据样本问题

    structure(list(last_updated = structure(c(3L, 1L, 7L, 2L, 10L, 6L, 8L, 9L, 7L, 5L, 4L), .Label = c("2016-05-13T12:53:56.704184Z", 
"2016-05-13T12:54:09.273359Z", "2016-05-13T12:54:22.757251Z", 
"2016-05-14T12:44:13.474992Z", "2016-05-14T12:44:31.736469Z", 
"2016-05-16T16:45:10.623410Z", "2016-05-16T16:46:17.881402Z", 
"2016-05-16T16:46:55.122257Z", "2016-05-16T16:47:14.160793Z", 
"2016-05-24T02:26:04.770799Z"), class = "factor"), feedback = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), A = structure(c(NA, 
NA, 2L, NA, 1L, NA, NA, NA, 2L, NA, NA), .Label = c("", "Slightly Disagree"
), class = "factor"), B = structure(c(NA, NA, 2L, NA, 1L, NA, 
NA, NA, 3L, NA, NA), .Label = c("", "Disagree", "Strongly Agree"
), class = "factor"), C = structure(c(NA, NA, 2L, NA, 1L, NA, 
NA, NA, 3L, NA, NA), .Label = c("", "Agree", "Disagree"), class = "factor"), 
    D = structure(c(NA, NA, 2L, NA, 1L, NA, NA, NA, 2L, NA, NA
    ), .Label = c("", "Agree"), class = "factor"), E = structure(c(NA, 
    NA, 2L, NA, 1L, NA, NA, NA, 3L, NA, NA), .Label = c("", "Agree", 
    "Strongly Disagree"), class = "factor")), .Names = c("last_updated", 
"feedback", "A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA, 
-11L))

数据样本解决方案

df<-dget(structure(list(last_updated = structure(c(3L, 1L, 7L, 2L, 10L,   6L,8L, 9L, 7L, 5L, 4L), .Label = c("2016-05-13T12:53:56.704184Z", 
"2016-05-13T12:54:09.273359Z", "2016-05-13T12:54:22.757251Z", 
"2016-05-14T12:44:13.474992Z", "2016-05-14T12:44:31.736469Z", 
"2016-05-16T16:45:10.623410Z", "2016-05-16T16:46:17.881402Z", 
"2016-05-16T16:46:55.122257Z", "2016-05-16T16:47:14.160793Z", 
"2016-05-24T02:26:04.770799Z"), class = "factor"), feedback = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), A = c(NA, NA, 2L, NA, 
NA, NA, NA, NA, 2L, NA, NA), B = c(NA, NA, 1L, NA, NA, NA, NA, 
NA, 5L, NA, NA), C = c(NA, NA, 4L, NA, NA, NA, NA, NA, 1L, NA, 
NA), D = c(NA, NA, 4L, NA, NA, NA, NA, NA, 4L, NA, NA), E = c(NA, 
NA, 4L, NA, NA, NA, NA, NA, 0L, NA, NA)), .Names = c("last_updated", 
"feedback", "A", "B", "C", "D", "E"), class = "data.frame", row.names =    c(NA,-11L)))

3 个答案:

答案 0 :(得分:2)

我们可以factor使用levels指定的

 nm1 <- c('Strongly Disagree', 'Disagree',
     'Slightly Disagree','Slightly Agree','Agree', 'Strongly Agree')

 factor(dat$col, levels = nm1,
         labels = 0:5))

如果有多个factor列具有相同级别,请标识factor列('i1'),并使用lapply循环显示该列,并指定levelslabels

i1 <- sapply(dat, is.factor)
dat[i1] <- lapply(dat[i1], factor, levels = nm1, labels= 0:5)

更新

使用OP的dput输出

dat[-(1:2)] <- lapply(dat[-(1:2)], factor, levels = nm1, labels = 0:5)
dat
#                last_updated feedback    A    B    C    D    E
#1  2016-05-13T12:54:22.757251Z       NA <NA> <NA> <NA> <NA> <NA>
#2  2016-05-13T12:53:56.704184Z       NA <NA> <NA> <NA> <NA> <NA>
#3  2016-05-16T16:46:17.881402Z       NA    2    1    4    4    4
#4  2016-05-13T12:54:09.273359Z       NA <NA> <NA> <NA> <NA> <NA>
#5  2016-05-24T02:26:04.770799Z       NA <NA> <NA> <NA> <NA> <NA>
#6  2016-05-16T16:45:10.623410Z       NA <NA> <NA> <NA> <NA> <NA>
#7  2016-05-16T16:46:55.122257Z       NA <NA> <NA> <NA> <NA> <NA>
#8  2016-05-16T16:47:14.160793Z       NA <NA> <NA> <NA> <NA> <NA>
#9  2016-05-16T16:46:17.881402Z       NA    2    5    1    4    0
#10 2016-05-14T12:44:31.736469Z       NA <NA> <NA> <NA> <NA> <NA>
#11 2016-05-14T12:44:13.474992Z       NA <NA> <NA> <NA> <NA> <NA>

另一个选项是来自set

data.table
library(data.table)
for(j in names(dat)[-(1:2)]){
  set(dat, i = NULL, j= j, value = factor(dat[[j]], levels = nm1, labels = 0:5))
 }

答案 1 :(得分:2)

我只是将每个目标列向量匹配到预先计算的字符向量中以获得整数索引。您可以在之后减去1以将范围从1:6更改为0:5。

## define desired value order, ascending
o <- c(
    'Strongly Disagree',
    'Disagree',
    'Slightly Disagree',
    'Slightly Agree',
    'Agree',
    'Strongly Agree'
);

## convert target columns
for (cn in names(df)[-(1:2)]) df[[cn]] <- match(as.character(df[[cn]]),o)-1L;
df;
##                   last_updated feedback  A  B  C  D  E
## 1  2016-05-13T12:54:22.757251Z       NA NA NA NA NA NA
## 2  2016-05-13T12:53:56.704184Z       NA NA NA NA NA NA
## 3  2016-05-16T16:46:17.881402Z       NA  2  1  4  4  4
## 4  2016-05-13T12:54:09.273359Z       NA NA NA NA NA NA
## 5  2016-05-24T02:26:04.770799Z       NA NA NA NA NA NA
## 6  2016-05-16T16:45:10.623410Z       NA NA NA NA NA NA
## 7  2016-05-16T16:46:55.122257Z       NA NA NA NA NA NA
## 8  2016-05-16T16:47:14.160793Z       NA NA NA NA NA NA
## 9  2016-05-16T16:46:17.881402Z       NA  2  5  1  4  0
## 10 2016-05-14T12:44:31.736469Z       NA NA NA NA NA NA
## 11 2016-05-14T12:44:13.474992Z       NA NA NA NA NA NA

答案 2 :(得分:0)

以前的答案可能会满足您的需求,但请注意,更改因子的标签与将因子更改为整数变量相同。一种可能性是使用ifelse(我发布了一个新的数据框,因为你发布的数据框实际上并没有包含这些级别的变量):

lev <- c('Strongly disagree', 'Disagree', 'Slightly disagree', 'Slightly agree', 'Agree', 'Strongly agree')

dta <- sample(lev, 55, replace = TRUE)
dta <- data.frame(matrix(dta, nrow = 11))
names(dta) <- LETTERS[1:5]

f_to_int <- function(f) {
  if (is.factor(f)){
   ifelse(f == 'Strongly disagree', 0, 
     ifelse(f == 'Disagree', 1, 
       ifelse(f == 'Slightly disagree', 2,`` 
         ifelse(f == 'Slightly agree', 3,
           ifelse(f == 'Agree', 4,
             ifelse(f == 'Strongly agree', 5, f))))))
  } else f
}

dta2 <- sapply(dta, f_to_int)

请注意,这会返回一个矩阵,但如果需要,可以很容易地将其转换为数据框。