我有一个包含20多列和超过2000行的大型数据集。我想知道不同变量共同出现的时间。另外,制作热图是很好的(共现热图或相关热图)。但是,我不确定您是否可以使用虚拟/二进制变量执行此操作。有小费吗?
我需要转换此示例数据集(x
)
A B C D E F
1 0 1 1 1 1 0
2 0 1 1 0 0 1
3 1 0 0 0 1 0
4 0 0 1 1 1 1
5 0 0 1 1 0 0
这样的事情:
A B C D E F
A 0 0 0 0 1 0
B 0 0 2 1 1 1
C 0 2 0 3 2 2
D 0 1 3 0 2 1
E 1 1 2 2 0 1
F 0 1 2 2 1 0
答案 0 :(得分:2)
给定矩阵X
,我们有
(A <- t(X) %*% X)
# A B C D E F
# A 1 0 0 0 1 0
# B 0 2 2 1 1 1
# C 0 2 4 3 2 2
# D 0 1 3 3 2 1
# E 1 1 2 2 3 1
# F 0 1 2 1 1 2
如果您希望对角线包含零,请添加diag(A) <- 0
。然后可以用例如
heatmap(A, Rowv = NA, Colv = NA)
答案 1 :(得分:2)
temp = sapply(colnames(A), function(x)
sapply(colnames(A), function(y)
sum(rowSums(A[,c(x, y)]) == 2)))
diag(temp) = 0
temp
# A B C D E F
#A 0 0 0 0 1 0
#B 0 0 2 1 1 1
#C 0 2 0 3 2 2
#D 0 1 3 0 2 1
#E 1 1 2 2 0 1
#F 0 1 2 1 1 0
library(reshape2)
library(ggplot2)
df1 = melt(temp)
graphics.off()
ggplot(df1, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
theme_classic()
数据强>
A = structure(list(A = c(0L, 0L, 1L, 0L, 0L), B = c(1L, 1L, 0L, 0L,
0L), C = c(1L, 1L, 0L, 1L, 1L), D = c(1L, 0L, 0L, 1L, 1L), E = c(1L,
0L, 1L, 1L, 0L), F = c(0L, 1L, 0L, 1L, 0L)), .Names = c("A",
"B", "C", "D", "E", "F"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))