我有这种格式的矩阵:
set.seed(1)
mat <- matrix(round(runif(25,0,1)),nrow=5,ncol=5)
colnames(mat) <- c("a1::C","a1::A","a1::B","b1::D","b1::A")
a1::C a1::A a1::B b1::D b1::A
[1,] 0 1 0 0 1
[2,] 0 1 0 1 0
[3,] 1 1 1 1 1
[4,] 1 1 0 0 0
[5,] 0 0 1 1 0
简而言之,每列都是主题和特征(由列名称表示,它们由::)分隔。在每一行中,值为1表示主题具有该特征,如果不具有该特征则为0。某个主题可能在特定行的所有列中都有0。
我想构建一个新的矩阵,其中列将成为主题(即每个主题一列),并且在行中,该主题具有的特征将按字母顺序排序并以逗号分隔。如果受试者没有任何特征(即某一行对于该受试者都具有0&#),则值为&#34; W&#34;应该使用(没有任何特征的值为&#34; W&#34;)。
以下是基于mat
的新矩阵的样子:
cnames = unique(sapply(colnames(mat), function(x) strsplit(x,split="::")[[1]][1]))
new_mat <- matrix(c("A","A","A,B,C","A,C","B",
"A","D","A,D","W","D"),
nrow=nrow(mat),ncol=length(cnames))
colnames(new_mat) = cnames
a1 b1
[1,] "A" "A"
[2,] "A" "D"
[3,] "A,B,C" "A,D"
[4,] "A,C" "W"
[5,] "B" "D"
知道什么是一种有效而优雅的方法来实现这一目标?
答案 0 :(得分:4)
第1步:矩阵列旋转
mat <- mat[, order(colnames(mat))]
# a1::A a1::B a1::C b1::A b1::D
# [1,] 1 0 0 1 0
# [2,] 1 0 0 0 1
# [3,] 1 1 1 1 1
# [4,] 1 0 1 0 0
# [5,] 0 1 0 0 1
步骤2.1:列名分解
## decompose levels, get main levels (before ::) and sub levels (post ::)
decom <- strsplit(colnames(mat), "::")
main_levels <- sapply(decom, "[", 1)
# [1] "a1" "a1" "a1" "b1" "b1"
sub_levels <- sapply(decom, "[", 2)
# [1] "A" "B" "C" "A" "D"
步骤2.2:对索引生成进行分组
## generating grouping index
main_index <- paste(rep(main_levels, each = nrow(mat)), rep(1:nrow(mat), times = ncol(mat)), sep = "#")
sub_index <- rep(sub_levels, each = nrow(mat))
sub_index[!as.logical(mat)] <- "" ## 0 values in mat implies ""
## in unclear of what "main_index" and "sub_index" are, check:
## matrix(main_index, nrow(mat))
# [,1] [,2] [,3] [,4] [,5]
# [1,] "a1#1" "a1#1" "a1#1" "b1#1" "b1#1"
# [2,] "a1#2" "a1#2" "a1#2" "b1#2" "b1#2"
# [3,] "a1#3" "a1#3" "a1#3" "b1#3" "b1#3"
# [4,] "a1#4" "a1#4" "a1#4" "b1#4" "b1#4"
# [5,] "a1#5" "a1#5" "a1#5" "b1#5" "b1#5"
## matrix(sub_index, nrow(mat))
# [,1] [,2] [,3] [,4] [,5]
# [1,] "A" "" "" "A" ""
# [2,] "A" "" "" "" "D"
# [3,] "A" "B" "C" "A" "D"
# [4,] "A" "" "C" "" ""
# [5,] "" "B" "" "" "D"
步骤2.3:有条件的折叠粘贴
## collapsed paste of "sub_index" conditional on "main_index"
x <- unname(tapply(sub_index, main_index, paste0, collapse = ""))
x[x == ""] <- "W"
# [1] "A" "A" "ABC" "AC" "B" "A" "D" "AD" "W" "D"
第3步:后期处理
我对此并不满意,但没有找到替代方案。
x <- sapply(strsplit(x, ""), paste0, collapse = ",")
# [1] "A" "A" "A,B,C" "A,C" "B" "A" "D" "A,D" "W" "D"
第4步:矩阵
x <- matrix(x, nrow = nrow(mat))
colnames(x) <- unique(main_levels)
# a1 b1
# [1,] "A" "A"
# [2,] "A" "D"
# [3,] "A,B,C" "A,D"
# [4,] "A,C" "W"
# [5,] "B" "D"
效率考虑
使用矢量化方法本身相当有效,并且不需要手动输入分组信息。例如,当你有数百个主要组(之前::)和数百个子组(post ::)时,你可以使用相同的代码。
唯一的考虑因素是减少不必要的内存副本。在这方面,我们应该尽可能使用匿名函数,而不需要像上面所示的显式矩阵赋值。这样会很好(已经过测试):
decom <- strsplit(sort(colnames(mat)), "::")
main_levels <- sapply(decom, "[", 1)
sub_index <- rep(sapply(decom, "[", 2), each = nrow(mat))
sub_index[!as.logical(mat[, order(colnames(mat))])] <- ""
x <- unname(tapply(sub_index,
paste(rep(main_levels, each = nrow(mat)),
rep(1:nrow(mat), times = ncol(mat)),
sep = "#"),
paste0, collapse = ""))
x <- matrix(sapply(strsplit(x, ""), paste0, collapse = ","),
nrow = nrow(mat))
colnames(x) <- unique(main_levels)
答案 1 :(得分:2)
这是一个起点。根据您拥有的变量数量,这可能会变得很麻烦。
library(data.table)
dt = data.table(id = seq_len(nrow(mat)), mat)
longDt <- melt(dt, id.vars = "id", measure = patterns("^a1::", "^b1::"))
longDt[, .(a1 = list(sort(c("C", "A", "B")[as.logical(value1)])),
b1 = list(sort(c("D", "A")[as.logical(value2)]))), .(id)]
id a1 b1
1: 1 A A
2: 2 A D
3: 3 A,B,C A,D
4: 4 A,C
5: 5 B D