R data.table如何在多个二进制数据列中使用列名替换正值

时间:2015-10-21 18:39:10

标签: r replace logic data.table multiple-columns

我正在使用R v.3.2.1和data.table v 1.9.6。 我有一个data.table,如下例所示,其中包含一些编码的二进制列,这些列被归类为值为“0”和“1”的字符,还包含一个字符串向量,其中包含与二进制列名称具有相同字词的短语。我的最终目标是使用字符串向量中的单词以及二进制向量中的正响应来创建wordcloud。要做到这一点,我首先需要将二进制向量中的正响应转换为它们的列名,但有些地方我会被卡住。

类似的问题已经被问到here但是它与海报以矩阵开始并不完全相同,并且建议的解决方案似乎不适用于更复杂的数据集。我的二进制列以外的列也包含其中的列,因此解决方案需要首先准确识别我的二进制列。

以下是一些示例数据:

id <- c(1,2,3,4,5)
age <- c("5", "1", "11", "20", "21")
apple <- c("0", "1", NA, "1", "0")
pear <- c("1", "1", "1", "0", "0")
banana <- c("0", "1", "1", NA, "1")
favfood <- c("i love pear juice", "i eat chinese pears and crab apples every sunday", "i also like apple tart", "i like crab apple juice", "i hate most fruit except bananas" )

df <- as.data.frame(cbind(id, age, apple, pear, banana, favfood), stringsAsFactors=FALSE)
dt <- data.table(df)
dt[, id := as.numeric(id)]

以下是数据的样子:

    id age apple pear banana                                          favfood
1:  1   5     0    1      0                                i love pear juice
2:  2   1     1    1      1 i eat chinese pears and crab apples every sunday
3:  3  11    NA    1      1                           i also like apple tart
4:  4  20     1    0     NA                          i like crab apple juice
5:  5  21     0    0      1                 i hate most fruit except bananas

因此,如果apple == 1或者favfood将字符串“apple”或两者兼容,那么wordcloud的频率应为1;等等。

这是我的尝试(这不是我想要的,但是大约一半):

# First define the logic columns.
# I've done this by name here but in my real data set this won't work because there are too many    
logicols <- c("apple", "pear", "banana")

# Next identify the location of the "1"s within the subset of logic columns:
ones <- which(dt==1 & colnames(dt) %in% logicols, arr.ind=T)

# Lastly, convert the "1"s in the subset to their column names:
dt[ones, ]<-colnames(dt)[ones[,2]]

这给出了:

> dt
   id age apple pear banana                                          favfood
1:  1   5     0 pear      0                                i love pear juice
2:  2   1     1 pear banana i eat chinese pears and crab apples every sunday
3:  3  11    NA    1 banana                           i also like apple tart
4:  4  20     1    0     NA                          i like crab apple juice
5:  5  21     0    0      1                 i hate most fruit except bananas

这种方法存在两个问题:

(a)识别要按名称转换的列对我的真实数据集不方便,因为它们有很多。如何识别这个列的子集,而不包括其他包含1的列但在其中也包含其他值(在此示例中,“age”包含1,但它显然不是逻辑列)?我故意将“age”编码为示例中的字符列,就像在我的真实数据集中一样,有些字符列包含不是逻辑列的1。将它们区分开来的功能是我的逻辑列是字符,但只包含值0,1或缺少(NA)。

(b)索引没有拾取逻辑列中的所有1,有没有人知道为什么这样(例如“apple”列的第二行中的1未转换)?

非常感谢你的帮助 - 我确信我错过了一些比较简单的东西,但是我对此非常感兴趣。

1 个答案:

答案 0 :(得分:1)

感谢@Frank指出逻辑/二进制列应该已经转换为as.logical()的正确类。

这极大地简化了要更改的值的标识,现在索引似乎也可以正常工作:

# Starting with the data in its original format:
id <- c(1,2,3,4,5)
age <- c("5", "1", "11", "20", "21")
apple <- c("0", "1", NA, "1", "0")
pear <- c("1", "1", "1", "0", "0")
banana <- c("0", "1", "1", NA, "1")
favfood <- c("i love pear juice", "i eat chinese pears and crab apples every sunday", "i also like apple tart", "i like crab apple juice", "i hate most fruit except bananas" )

df <- as.data.frame(cbind(id, age, apple, pear, banana, favfood), stringsAsFactors=FALSE)

# Convert the "0" / "1" character columns to logical with a function:

    > recode.multi
    function(data, recode.cols, old.var, new.var, format = as.numeric){
      # function to recode multiple columns 
      #
      # Args:        data: a data.frame 
      #       recode.cols: a character vector containing the names of those 
      #                    columns to recode
      #           old.var: a character vector containing values to be recorded
      #           new.var:  a character vector containing desired recoded values
      #            format: a function descrbing the desired format e.g.
      #                    as.character, as.numeric, as.factor, etc.. 

      # check from and to are of equal length
      if(length(old.var) == length(new.var)){
        NULL
      } else {
        stop("'from' and 'to' are of differing lengths")
      }

      # convert format of selected columns to character
      if(length(recode.cols) == 1){
        data[, recode.cols] = as.character(data[, recode.cols])
      } else {
        data[, recode.cols] = data.frame(lapply(data[, recode.cols], as.character), stringsAsFactors=FALSE)
      }


      # recode old variables to new variables for selected columns
      for(i in 1:length(old.var)){
        data[, recode.cols][data[, recode.cols] == old.var[i]] = new.var[i]
      }


  # convert recoded columns to desired format 
  data[, recode.cols] = sapply(data[, recode.cols], format)

  data
}

df = recode.multi(data = df, recode.cols = c(unlist(strsplit("apple pear banana", split=" "))), old.var = c("0", "1", NA), new.var = c(FALSE, TRUE, NA), format = as.logical)

dt <- data.table(df)
dt[, id := as.numeric(id)]

# Identify the values to swap with column names:
convtoname <- which(dt==TRUE, arr.ind=T)

# Make the swap:
dt[convtoname, ]<-colnames(dt)[convtoname[,2]]

这给出了期望的结果:

> dt
   id age apple  pear banana                                          favfood
1: id   5 FALSE  pear  FALSE                                i love pear juice
2:  2   1 apple  pear banana i eat chinese pears and crab apples every sunday
3:  3  11    NA  pear banana                           i also like apple tart
4:  4  20 apple FALSE     NA                          i like crab apple juice
5:  5  21 FALSE FALSE banana                 i hate most fruit except bananas