我有这样的数据:
dat <- data.frame(
music = c("classical", "jazz", "baroque", "electronic", "ambient"),
john = c(1,1,0,1,1),
jeff = c(1,0,0,1,0),
jane = c(0,1,1,0,0)
)
music john jeff jane
1 classical 1 1 0
2 jazz 1 0 1
3 baroque 0 0 1
4 electronic 1 1 0
5 ambient 1 0 0
并且想要绘制列上各个人之间的重叠图 - 他们在同一行中有多少次1?如果我能够进入data.frame
:
result <- data.frame(person1 = c("john", "john", "jeff"), person2 = c("jeff", "jane", "jane"), overlap = c(2, 1, 0))
person1 person2 overlap
1 john jeff 2
2 john jane 1
3 jeff jane 0
我可以创建我想到的图表:
library(igraph)
g <- graph.data.frame(result, directed = FALSE)
plot(g, edge.width = result$overlap * 3)
但我正在努力转换数据以计算每对列之间的行方向重叠。我怎么能这样做?
答案 0 :(得分:4)
可能更简单的方法是通过采用交叉积来创建图的邻接矩阵。然后,您可以直接阅读igraph。
library(igraph)
# Take the crossproduct: assumes unique music types in each row
# otherwise aggregate terms
m <- crossprod(as.matrix(dat[-1]))
# You could remove the diagonal terms here
# although it is useful to see the sum for each individual
# You can also remove it in igraph, as below
# diag(m) <- 0
# Create graph
# The weights are stored in E(g)$weight
g <- graph_from_adjacency_matrix(m, mode="undirected", weighted = TRUE)
# Remove edge loops
g <- simplify(g)
答案 1 :(得分:2)
也许你想尝试不同的相似/距离测量,如Russel / Roa,Jaccard等。我的意思是:0和0也可以被解释为相似性。无论如何,这是另一种方法:
library(proxy)
m <- (1-as.matrix(dist( t(dat[, -1]), method = "Russel")))*nrow(dat)
m[lower.tri(m, T)] <- NA
(res <- setNames(reshape2::melt(m, na.rm=T), c("p1", "p2", "ol")))
# p1 p2 ol
# 4 john jeff 2
# 7 john jane 1
# 8 jeff jane 0
答案 2 :(得分:1)
以下适用于您的示例:
# build name matrix
nameMat <- t(combn(names(dat[,-1]), 2))
# pre-allocate count vector
overLap <- integer(nrow(nameMat))
# loop through name combos
for(i in 1:nrow(nameMat)) {
overLap[i] <- sum(rowSums(dat[, nameMat[i,]]) == 2)
}
# construct data.frame
df <- data.frame("person1"=nameMat[,1], "person2"=nameMat[,2], "overLap"=overLap)
如果您不喜欢for
个循环,可以使用sapply
来获取重叠次数:
overLap <- sapply(1:(nrow(nameMat)),
function(i) sum(rowSums(dat[, nameMat[i,]]) == 2))
根据@ user20650的建议,您还可以使用combn
:
overLap <- combn(dat[-1], 2, FUN=function(i) sum(rowSums(i)==2))
计算重叠的长方法如下: overLap&lt; - sapply(1:(nrow(nameMat)),function(i)sum(rowSums(dat [,nameMat [i,]] == c(1,1))== 2))
这个较长的版本具有一个优点,因为它可以推广到这些度量是相似规模的情况(表示亲和力的强度)。在5分制的情况下,c(1,1)可以改为c(3,3)来检查无差异或c(5,5)。如果兴趣是极端相反的意见,例如c(1,5),则必须操纵和复制nameMat:
newNameMat <- rbind(nameMat, cibind(nameMat[,2], nameMat[,1])
并对此矩阵执行计算。将这些操作包装成一个可以计算任意比例的Likert比例组合的函数是不会太难的。