我有一个包含93个元素和3个数字变量的data.frame。第三个变量"组件"按ID分组一些行。数据由巨大图表的边缘列表组成,组件编号表示属于同一连接组件的行。大约有8300万个这样的组件。
我现在正试图将数据框拆分为8300万个data.frames列表。我这样做是为了将一些igraph函数应用于每个组件。
This SO answer表示split()
是此解决方案。
library(dplyr,data.table,igraph)
# d6b: data.frame with edge A, edge B, component, 93 millon rows, 83 million components, object.size=2,4Gb
d6b <- d6a %>% split(f = d6a$component )
# This takes 7,1 hours to run, and creates a 94.8 Gb object
#Then try to run igraph on each element of the list
d6b %>% lapply(graph_from_data_frame,directed = TRUE) -> g6a
#code above ran for 20 hours without finishing
有更快的方法吗?还有另一种结构不会变得那么大吗?
编辑:根据格雷戈尔的评论,我改变了工作流程:
#Selecting only the non trivial components
# removing all 1:n or n:1 (incluind the 70mi 1:1)
d6a %>% group_by(component) %>%
mutate(N_edges=n(),
N_cpf=n_distinct(cpf),
N_pis=n_distinct(pis)) -> d6b #takes 1h
d6b_dt <- data.table(d6b) # takes 11min
d6b_dtf <- d6b_dt[N_cpf>1 & N_pis>1] # 5s
setkey(d6b_dtf, component) #1s
然后尝试实施建议:
d6b_dtf %>% group_by(component) %>% select(cpf,pis) %>%
do(graph_from_data_frame, directed = TRUE) -> g_d6b_dtf
我收到以下错误消息:
Adding missing grouping variables: `component`
Error: Arguments to do() must either be all named or all unnamed