将CSV文件更改为表格格式的Sparklyr问题

时间:2018-08-30 21:48:27

标签: java csv dplyr sparklyr collect

我正在使用Sparklyr通过Louvain算法生成结果。我有两个csv文件,一个称为节点(13g),另一个称为边缘(209g)。我正在尝试使用spark_read_csv将两个文件读入R中的内存中,然后使用tbl_df转换为表格式。

我目前正在使用sparklyr的客户端方法,将主服务器设置为“本地”,并将内核数设置为15。

我在尝试读取其中的整个数据集时收到重复出现的错误,因此将13g csv文件分为两个6.2g和6.3g。

我为Sparklyr连接设置的当前配置如下:

conf$`sparklyr.cores.local` <- 15
conf$`sparklyr.shell.driver-memory` <- "240G"
conf$spark.executor.memory <- "240G"
conf$spark.memory.fraction <- 0.8
conf$spark.driver.cores <- 15
conf$spark.driver.maxResultSize <- "24G"

我当前正在使用的服务器上有515g可用内存。我将sparklyr.shell.driver-memory设置为“ 240g”,因为这占了计算机上可用的内存量减去操作所需的内存量。我将spark.executor.memory设置为24g,并运行10个以上的内核。心跳超时间隔已设置为10000000,以解决所有超时问题。尝试在6.2G csv文件上使用collect命令时,收到Java堆大小错误。

我当前在使用的服务器上有515G可用空间。这两个csv文件加起来总共为222g,因此我将执行程序的内存设置为比此更大。我还将驱动程序内存设置为250g,以考虑驱动程序可能需要处理的任何性能。 Java选项的最大堆大小也已设置为24g,但是Java堆大小与尝试将csv文件导出到表数据帧中时收到的错误有关。

library(dplyr)
library(igraph)
library(sparklyr)
library(data.table)
library(tidyverse)

#Install sparklyr version 2.0.0        
spark_install(version = "2.0.0")

#Disconnect any previous sparklyr sessions
spark_disconnect_all()

#Create an empty frame for sc config
conf <- spark_config()

#The configurations for sparklyr 
conf$spark.driver.cores <- 15 
conf$spark.executor.memory <- '240g'
conf$spark.driver.memory <- '250g'
#conf$`spark.yarn.executor.memoryOverhead` <- "10g"
#conf$spark.driver.extraJavaOptions="-Xmx9g"
conf[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=- 
XX:MaxHeapSize=24G"

#Creates the sc connections
 sc <- spark_connect(master = "local[15]", 
                version = "2.1.0",
                config = conf)

#Reads in the nodes csv file
nodes <- spark_read_csv(sc, name = "nodes", path = 
"/etl/louvain/R/cat/node_R.csv", header = TRUE)

#Turns the above into a table format
#This is where the error is produced
#Error: org.apache.spark.SparkException: Job 6 cancelled because 
SparkContext was shut down 

nodes1 <- tbl_df(nodes) 
nodes_tbl <- copy_to(sc, nodes)

  #Counts the node as below
  nodes %>% count

    #Creates four subsets of the nodes file
    nodes2 <- nodes1 %>% slice(1:88805991)
    nodes3 <- nodes1 %>% slice(88805991:177611982)
    nodes4 <- nodes1 %>% slice(177611982:266417973)
    nodes5 <- nodes1 %>% slice(266417973:355223963)

    #Uses each of the above data frames and collects them to use as a R 
    frame

    collect(nodes2)
    collect(nodes3)
    collect(nodes4)
    collect(nodes5)

    Nodes <- rbind(nodes2, nodes3, nodes4, nodes5, by = 'id')
    nodesNew <- h[!duplicated(h$id),]

    write.csv(nodesNew, file = "/etl/louvain/R/cat/nodesNew.csv", 
    row.names=FALSE)

    edges <- spark_read_csv(sc, name = "edges", path =  
    "/etl/louvain/R/awk/newEdge.csv", header = TRUE)
    edges1 <- tbl_df(edges)


    #nodes <- distinct(nodes, "id")

    el1 <- collect(edges)

    el=as.matrix(el1) 
    el[,1]=as.character(el[,1])
    el[,2]=as.character(el[,2])

    install.packages("graphframes")
    library(graphframes)
    library(igraph)
    clustergraph1 <- graph_from_data_frame(el, directed = FALSE, 
    vertices = n2)
    #Assigns the louvain algorithm to the above graph
    Community200k <- cluster_louvain(clustergraph1)
    #Prints the values of each communty
    print(Community200k)

0 个答案:

没有答案