重新编码r中的多个列

时间:2016-05-09 19:30:00

标签: r multiple-columns flags recode

我无法找到这个具体问题的答案。我想将多个字符列重新编码为数字列。 (这是一百列)但是:

  • 列不会总是处于相同的顺序(我重新编码 每月更新的数据。)
  • 列由我不想重新编码的列分隔。
  • 数据集并不总是包含相同的列。

所以,我认为我不能使用一系列列索引。但是,我希望重新编码的列以相同的列名前缀开头。我想将任何“是”重新编码为1,将“否”重新编码为0,并将空白重新编码为NA。

我可以使用以下代码一次手动执行此操作:

    #Recode columns one at a time

    library(car)
    #skip ID column
    #Skip Date column
    df$Q1<-as.numeric(as.character(recode(df$Q1,"NA=NA; 'No'=0; 'Yes'=1; ''=NA")))
    df$Q2<-as.numeric(as.character(recode(df$Q2,"NA=NA; 'No'=0; 'Yes'=1; ''=NA")))
    #skip Q2.Explanation column
    #do the above for a hundred more columns...

但我想同时重新编写一百个特定列。这些列也是由我不想重新编码的列分隔的。

我的数据如下。不知道什么是dput:

    ID<-c(01,02,03,04,05)
    Q1<-c("Yes", NA,"", "No",NA)
    Q1.Explanation<-c (NA, NA,"","Respondent did not get the correct answer", NA)
    Q2<-c("No","Yes","Yes","", NA)
    Q2.Explanation <-c("The right answer was not proven", NA, NA, NA, NA)
    Q3<-c("", NA, "Yes", NA, NA)
    Mydata<-as.data.frame(cbind(ID,Q1,Q1.Explanation, Q2, Q2.Explanation,Q3))

2 个答案:

答案 0 :(得分:2)

如果您知道要更改的列始终具有相同的名称,只是表中的不同位置,那么您可以使用列名称上的正则表达式进行子集化,然后使用{{1}更改列中的值}。

apply()

这应该重新编码以&#34; Q&#34;开头的所有列。无论他们在任何一个月的位置。

答案 1 :(得分:1)

对于data.table粉丝,我有另一个解决方案,它还具有使用factors代替数字整数进行重新编码的优势,以便 数值的含义仍然正确显示(提高数据的可读性):

library(data.table)

ID<-c(01,02,03,04,05)
Q1<-c("Yes", NA,"", "No",NA)
Q1.Explanation<-c (NA, NA,"","Respondent did not get the correct answer", NA)
Q2<-c("No","Yes","Yes","", NA)
Q2.Explanation <-c("The right answer was not proven", NA, NA, NA, NA)
Q3<-c("", NA, "Yes", NA, NA)
Mydata<-as.data.frame(cbind(ID,Q1,Q1.Explanation, Q2, Q2.Explanation,Q3))

Mydata

# The solution starts here... ----------------------------------------------

setDT(Mydata)     # convert data.frame into data.table

# the regular expression selects all column names starting with a "Q" followed by digits until the end
affected.cols <- colnames(Mydata)[grep("^Q\\d+$", colnames(Mydata))]

# convert the columns to factors; trailing square brackets are only added to print the output
Mydata[, (affected.cols) := lapply(affected.cols, function(x) { .SD[, factor(get(x), c("No", "Yes")) ] })] []

str(Mydata)           # Columns are encoded as factors ("enumerated types") now, which is an integer internally that has a string label

# Proof: 1 = "No", 2 = "Yes"; the "excluded" parameter of "factor()" caused all other values (mainly empty strings) to be translated into NAs
as.numeric(Mydata$Q1)

结果是:

> as.numeric(Mydata$Q1)
[1]  2 NA NA  1 NA


> Mydata
   ID  Q1                            Q1.Explanation  Q2                  Q2.Explanation  Q3
1:  1 Yes                                        NA  No The right answer was not proven  NA
2:  2  NA                                        NA Yes                              NA  NA
3:  3  NA                                           Yes                              NA Yes
4:  4  No Respondent did not get the correct answer  NA                              NA  NA
5:  5  NA                                        NA  NA                              NA  NA

正确转换为数值是因为幸运的情况是请求的数值以1开头,因此“No”的级别索引为1,“是”级别索引为2。