我正在使用Sparklyr通过Louvain算法生成结果。我有两个csv文件,一个称为节点(13g),另一个称为边缘(209g)。我正在尝试使用spark_read_csv将两个文件读入R中的内存中,然后使用tbl_df转换为表格式。
我目前正在使用sparklyr的客户端方法,将主服务器设置为“本地”,并将内核数设置为15。
我在尝试读取其中的整个数据集时收到重复出现的错误,因此将13g csv文件分为两个6.2g和6.3g。
我为Sparklyr连接设置的当前配置如下:
conf$`sparklyr.cores.local` <- 15
conf$`sparklyr.shell.driver-memory` <- "240G"
conf$spark.executor.memory <- "240G"
conf$spark.memory.fraction <- 0.8
conf$spark.driver.cores <- 15
conf$spark.driver.maxResultSize <- "24G"
我当前正在使用的服务器上有515g可用内存。我将sparklyr.shell.driver-memory设置为“ 240g”,因为这占了计算机上可用的内存量减去操作所需的内存量。我将spark.executor.memory设置为24g,并运行10个以上的内核。心跳超时间隔已设置为10000000,以解决所有超时问题。尝试在6.2G csv文件上使用collect命令时,收到Java堆大小错误。
我当前在使用的服务器上有515G可用空间。这两个csv文件加起来总共为222g,因此我将执行程序的内存设置为比此更大。我还将驱动程序内存设置为250g,以考虑驱动程序可能需要处理的任何性能。 Java选项的最大堆大小也已设置为24g,但是Java堆大小与尝试将csv文件导出到表数据帧中时收到的错误有关。
library(dplyr)
library(igraph)
library(sparklyr)
library(data.table)
library(tidyverse)
#Install sparklyr version 2.0.0
spark_install(version = "2.0.0")
#Disconnect any previous sparklyr sessions
spark_disconnect_all()
#Create an empty frame for sc config
conf <- spark_config()
#The configurations for sparklyr
conf$spark.driver.cores <- 15
conf$spark.executor.memory <- '240g'
conf$spark.driver.memory <- '250g'
#conf$`spark.yarn.executor.memoryOverhead` <- "10g"
#conf$spark.driver.extraJavaOptions="-Xmx9g"
conf[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=-
XX:MaxHeapSize=24G"
#Creates the sc connections
sc <- spark_connect(master = "local[15]",
version = "2.1.0",
config = conf)
#Reads in the nodes csv file
nodes <- spark_read_csv(sc, name = "nodes", path =
"/etl/louvain/R/cat/node_R.csv", header = TRUE)
#Turns the above into a table format
#This is where the error is produced
#Error: org.apache.spark.SparkException: Job 6 cancelled because
SparkContext was shut down
nodes1 <- tbl_df(nodes)
nodes_tbl <- copy_to(sc, nodes)
#Counts the node as below
nodes %>% count
#Creates four subsets of the nodes file
nodes2 <- nodes1 %>% slice(1:88805991)
nodes3 <- nodes1 %>% slice(88805991:177611982)
nodes4 <- nodes1 %>% slice(177611982:266417973)
nodes5 <- nodes1 %>% slice(266417973:355223963)
#Uses each of the above data frames and collects them to use as a R
frame
collect(nodes2)
collect(nodes3)
collect(nodes4)
collect(nodes5)
Nodes <- rbind(nodes2, nodes3, nodes4, nodes5, by = 'id')
nodesNew <- h[!duplicated(h$id),]
write.csv(nodesNew, file = "/etl/louvain/R/cat/nodesNew.csv",
row.names=FALSE)
edges <- spark_read_csv(sc, name = "edges", path =
"/etl/louvain/R/awk/newEdge.csv", header = TRUE)
edges1 <- tbl_df(edges)
#nodes <- distinct(nodes, "id")
el1 <- collect(edges)
el=as.matrix(el1)
el[,1]=as.character(el[,1])
el[,2]=as.character(el[,2])
install.packages("graphframes")
library(graphframes)
library(igraph)
clustergraph1 <- graph_from_data_frame(el, directed = FALSE,
vertices = n2)
#Assigns the louvain algorithm to the above graph
Community200k <- cluster_louvain(clustergraph1)
#Prints the values of each communty
print(Community200k)