我有一个有向的二分图g
,有215473个顶点和2326714个边。在创建bipartite.projection
g
时,我的内存不足(在崩溃之前它使用~35 gig的RAM)。
我尝试按照previous thread on nongnu.org来计算我需要多少内存。
根据此线程中提供的信息,将图形存储在内存成本中(以字节为单位):
(4*|E|+2*|V|) * 8 + 4*|V|
要计算投影需要以下内存(以字节为单位):
16*|V| + (2*|V|+2*|E|) * 8
因此,对于我的图表g
,它将花费:
((4*2326714+2*215473) * 8 + 4*215473) + (16*215473 + (2*215473+2*2326714) * 8)
= 78764308 + 44122560
= 122886868 (bytes)
= 122.886868 (mb)
显然,这是不正确的,我一定是做错了。
任何人都可以帮忙弄清楚如何创建我的图表的二分投影吗?
答案 0 :(得分:1)
使用稀疏矩阵可能会解决您的问题。
# Load tiny toy data as edgelist
df <- data.frame( person =
c('Sam','Sam','Sam','Greg','Tom','Tom','Tom','Mary','Mary'), group =
c('a','b','c','a','b','c','d','b','d'), stringsAsFactors = F)
# Transform data to a sparse matrix
library(Matrix)
A <- spMatrix(nrow=length(unique(df$person)),
ncol=length(unique(df$group)),
i = as.numeric(factor(df$person)),
j = as.numeric(factor(df$group)),
x = rep(1, length(as.numeric(df$person))) )
row.names(A) <- levels(factor(df$person))
colnames(A) <- levels(factor(df$group))
为了进行投影你有多种可能性,这里有两个:
# Use base r
Arow <- tcrossprod(A)
# Alternatively, if you want to project on the other mode:
Acol <- tcrossprod(t(A))
# Use the igraph package, which works with sparse matrices
library(igraph)
g <- graph.incidence(A)
# The command bipartite.projection does both possible projections at once
proj <- bipartite.projection(g)
#proj[[1]]
#proj[[2]]
您还可以使用spMatrix
读入数据并在data.table
命令中进行转换,这也将加快这些操作。
<强>更新强>:
以下是一个更大的图表和一些内存基准测试的示例:
# Load packages
library(data.table)
library(igraph)
# Scientific collaboration dataset
# Descriptives as reported on https://toreopsahl.com/datasets/#newman2001
# mode 1 elements: 16726
# mode 2 elements: 22016
# two mode ties: 58595
# one mode ties: 47594
d <- fread("http://opsahl.co.uk/tnet/datasets/Newman-Cond_mat_95-99-two_mode.txt",
stringsAsFactors=TRUE, colClasses = "factor", header=FALSE)
# Transform data to a sparse matrix
A <- spMatrix(nrow=length(unique(d[, V1])),
ncol=length(unique(d[, V2])),
i = as.numeric(d[, V1]),
j = as.numeric(d[, V2]),
x = rep(1, length(as.numeric(d[, V1]))) )
row.names(A) <- levels(d[, V1])
colnames(A) <- levels(d[, V2])
#To do the projection you have acutally multiple possiblities, here are two:
# Use base r
Arow <- tcrossprod(A)
# Alternatively, if you want to project on the other mode:
Acol <- tcrossprod(t(A))
下面概述了使用了多少内存,即稀疏矩阵方法在我的笔记本电脑上用于此网络,但是使用常规矩阵的方法确实会产生内存分配错误(即使从中移除Bcol
对象也是如此内存rm(Brow)
然后调用垃圾收集器gc()
)
object.size(A) # Spare matrix: 3108520 bytes
object.size(Arow) # 2713768 bytes
object.size(Acol) # 5542104 bytes
# For comparison
object.size(B <- as.matrix(A)) # Regular matrix: 2945783320 bytes
object.size(Brow <- tcrossprod(B)) # 2239946368 bytes
object.size(Bcol <- tcrossprod(t(B))) # Memory allocation error on my laptop