Question

我正在处理大约500列和数千行的数据表，每一列代表一个项目，该项目可能会出现在字符串中，也可能不会出现在字符串中。例如，我的数据看起来像这样

   String                Item 1 Item 2  Item 3 Item 4
1  "Item 1Item 2Item 4"    0     0        0     0
2  "Item 1Item 2"          0     0        0     0      
3  "Item 3"                0     0        0     0

我已将字符串解析为项目编号，结果项目在列表中（因此，对于上面的第一次观察，列表项目1将具有元素“项目1”，“项目2”和“项目4”。

我正在尝试通过编程方式更改列表的值，方法是使用列表的每个项目作为该行的列名称，然后为这些列分配1。例如，我可以构造一个简单的for循环，该循环可以执行我想要的操作：

for (i in 1:nrow(data)){
   data[i, eval(unlist(listofitems[[i]])) := 1]
}

返回

  String                Item 1 Item 2  Item 3 Item 4
1  "Item 1Item 2Item 4"    1     1        0     1
2  "Item 1Item 2"          1     1        0     0      
3  "Item 3"                0     0        1     0

但是，考虑到数据的大小和发生类似情况的频率（我希望能够对数据表进行行操作以引用方式分配给引用的列的次数），我希望可能存在获得最终答案的更多“ data.table-y”方法。

Answer 1

我们可以使用cSplit_e

library(splitstackshape)
out <- cSplit_e(data[1], 'String', type = 'character', sep=":", fill = 0)
names(out)[-1] <- sub("String_", "", names(out)[-1])
out
#              String Item 1 Item 3 Item2 Item4
#1 Item 1:Item2:Item4      1      0     1     1
#2       Item 1:Item2      1      0     1     0
#3             Item 3      0      1     0     0

数据

data <- structure(list(String = c("Item 1:Item2:Item4", "Item 1:Item2", 
"Item 3"), Item1 = c(0L, 0L, 0L), Item2 = c(0L, 0L, 0L), Item3 = c(0L, 
0L, 0L), Item4 = c(0L, 0L, 0L)), class = "data.frame", row.names = c("1", 
"2", "3"))

Answer 2

一个选项是使用矩阵数字索引进行分配：

cols <- setdiff(names(DT), c("String", "ParsedString"))
DT[, (cols) := {
    m <- cbind(rep(1L:.N, lengths(ParsedString)), 
        match(unlist(ParsedString), names(.SD)))
    ans <- as.matrix(.SD)
    ans[m] <- 1L
    as.data.table(ans)
}, .SDcols=cols]

数据：

library(data.table)
DT <- fread('String,Item 1,Item 2,Item 3,Item 4
"Item 1:Item 2:Item 4",0,0,0,0
"Item 1:Item 2",0,0,0,0      
"Item 3",0,0,0,0')
DT[, ParsedString := strsplit(String, split=":")]

有没有一种方法可以为按行编程选择的data.table列分配值？

2 个答案:

数据