Question

我正面临着无法解决的挑战。我有一个观测值清单x_i（尺寸很大，大约30k）和一个观测值清单y_j（也很大）。 x_i和y_i是相同单位的ID（例如商号）。

我有两列链接x_i和y_j的数据框：如果它们出现在同一行上，则表示它们已连接。我想将这个网络转换为大小为M的大型矩阵(unique(union(x, y)))，如果两家公司相连，则取值为1。

这是一个小尺寸的例子：

x1 x2
x3 x6
x4 x5
x1 x5

我想要一个矩阵：

0 1 0 0 1 0
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 1 0 0
0 0 0 0 0 0
0 0 0 0 0 0

现在，我唯一想到的解决方案是在原始数据帧中结合搜索的双循环：

list_firm = union(as.vector(df[1]), as.vector(df[2]))
list_firm <- sort(list_firm[[1]])
list_firm <- unique(list_firm)
M <- Matrix(nrow = length(list_firm), ncol = length(list_firm))

for (i in list_firm) {
    for (j in list_firm) {
        M[i, j] = !is.null(which(df$col1 == i & df$col2 == j))
    }
}

df是两列数据帧。这显然太长时间了。

有什么建议吗？这将非常受欢迎

Answer 1

我们使用指定为两列的factor元素的levels将列转换为unique，并使用table来获得频率

lvls <- sort(unique(unlist(df)))
df[] <- lapply(df, factor, levels = lvls)
table(df)
#  col2
#col1 x1 x2 x3 x4 x5 x6
#  x1  0  1  0  0  1  0
#  x2  0  0  0  0  0  0
#  x3  0  0  0  0  0  1
#  x4  0  0  0  0  1  0
#  x5  0  0  0  0  0  0
#  x6  0  0  0  0  0  0

数据

df <- structure(list(col1 = c("x1", "x3", "x4", "x1"), col2 = c("x2", 
 "x6", "x5", "x5")), class = "data.frame", row.names = c(NA, -4L
 ))

Answer 2

@akrun在评论中提供的答案是一个很好的答案。但是，这是利用与数据帧不同的数据结构的好方案。基本上，您要寻找的是邻接矩阵，这是社交网络分析中的数据结构。为此，我们可以在R中使用igraph包。

library(igraph)
library(dplyr)

df = data_frame(source=c('x1', 'x3', 'x4', 'x1'), target=c('x2', 'x6', 'x5', 'x5'))
g = graph_from_data_frame(df, directed=FALSE)
output = as.matrix(get.adjacency(g))

   x1 x3 x4 x2 x6 x5
x1  0  0  0  1  0  1
x3  0  0  0  0  1  0
x4  0  0  0  0  0  1
x2  1  0  0  0  0  0
x6  0  1  0  0  0  0
x5  1  0  1  0  0  0

输出列的排列顺序与您的示例不完全相同，但这是一个需要解决的琐碎问题。

将链接的观察结果的长表转换为宽的邻接矩阵

2 个答案:

数据