我想基于字符串列表创建一个二进制矩阵。
dt = data.table(id = c('id1','id2','id3','id4','id5','id6'), sample = c("MER-1,MER-3,MER-4","MER-5","MER-2","MER-2,MER-3,MER-4,MER-5","MER_3","MER-5" ))
dt
id sample
1: id1 MER-1,MER-3,MER-4
2: id2 MER-5
3: id3 MER-2
4: id4 MER-2,MER-3,MER-4,MER-5
5: id5 MER_3
6: id6 MER-5
应该会导致类似的情况
m_count = matrix(c(1,0,1,1,0, 0,0,0,0,1, 0,1,0,0,0, 0,1,1,1,1, 0,0,1,0,0, 0,0,0,0,1), nrow = 6, ncol = 5)
m_count
MER-1 MER-2 MER-3 MER-4 MER-5
id1 1 0 0 1 0
id2 0 0 0 1 0
id3 1 0 0 0 0
id4 1 1 0 0 0
id5 0 0 1 1 0
id6 0 1 1 0 1
我可以遍历列表的每个元素,并填充矩阵,但是鉴于表的大小,这确实很慢。有什么更快/更优雅的方法吗?也许与dplyr / tidyverse吗? 谢谢!
答案 0 :(得分:4)
使用注释末尾的dt
来修正问题中的错字,使用separate_rows
逐行扩展数据,然后使用table
计算计数。< / p>
library(data.table)
library(dplyr)
library(tidyr)
dt %>%
separate_rows(sample, sep = ",") %>%
table
给予:
sample
id MER-1 MER-2 MER-3 MER-4 MER-5
id1 1 0 1 1 0
id2 0 0 0 0 1
id3 0 1 0 0 0
id4 0 1 1 1 1
id5 0 0 1 0 0
id6 0 0 0 0 1
library(data.table)
dt <- data.table(id = c('id1','id2','id3','id4','id5','id6'),
sample = c("MER-1,MER-3,MER-4","MER-5","MER-2","MER-2,MER-3,MER-4,MER-5","MER-3","MER-5" ))
答案 1 :(得分:4)
您可以使用strsplit
:
table(dt[,unlist(strsplit(sample,",")),by=id])
V1
id MER-1 MER-2 MER-3 MER-4 MER-5 MER_3
id1 1 0 1 1 0 0
id2 0 0 0 0 1 0
id3 0 1 0 0 0 0
id4 0 1 1 1 1 0
id5 0 0 0 0 0 1
id6 0 0 0 0 1 0
答案 2 :(得分:3)
您还可以使用库splitstackshape
:
table(cSplit(dt, "sample", sep = ",", direction = "long"))
sample
id MER-1 MER-2 MER-3 MER-4 MER-5
id1 1 0 1 1 0
id2 0 0 0 0 1
id3 0 1 0 0 0
id4 0 1 1 1 1
id5 0 0 1 0 0
id6 0 0 0 0 1
或使用为此场景专门创建的cSplit_e
(由@ A5C1D2H2I1M1N2O1R2T1提供):
cSplit_e(dt, "sample", sep = ",", type = "character", fill = 0, drop = TRUE)