在SparkR中进行Logistic回归时出现OutOfMemoryError

时间:2014-10-06 21:15:10

标签: r apache-spark

我已成功通过Ubuntu 12.04(单独立模式)安装Apache Spark,Hadoop进行Logistic回归。还使用小型csv数据集进行了测试,但它不适用于具有269369行的大型数据集。

library(SparkR)
sc <- sparkR.init()
iterations <- as.integer(11)
D <- 540

readPartition <- function(part){
part = strsplit(part, ",", fixed = T)
list(matrix(as.numeric(unlist(part)), ncol = length(part[[1]])))
}
w <- runif(n=D, min = -1, max = 1)

cat("Initial w: ", w, "\n")

# Compute logistic regression gradient for a matrix of data points
gradient <- function(partition) {
  partition = partition[[1]]
  Y <- partition[, 1] # point labels (first column of input file)

  X <- partition[, -1] # point coordinates
  # For each point (x, y), compute gradient function
  #print(w)
  dot <- X %*% w      
  logit <- 1 / (1 + exp(-Y * dot))
  grad <- t(X) %*% ((logit - 1) * Y)
  list(grad)
}


for (i in 1:iterations) {
  cat("On iteration ", i, "\n")
  w <- w - reduce(lapplyPartition(points, gradient), "+")
}

> points <- cache(lapplyPartition(textFile(sc, "hdfs://localhost:54310/henry/cdata_mr.csv"), readPartition))

我收到的错误消息:

14/10/07 01:47:16 INFO FileInputFormat: Total input paths to process : 1
14/10/07 01:47:28 WARN CacheManager: Not enough space to cache partition rdd_23_0 in memory! Free memory is 235841615 bytes.
14/10/07 01:47:42 WARN CacheManager: Not enough space to cache partition rdd_23_1 in memory! Free memory is 236015334 bytes.
14/10/07 01:47:55 WARN CacheManager: Not enough space to cache partition rdd_23_2 in memory! Free memory is 236015334 bytes.
14/10/07 01:48:10 WARN CacheManager: Not enough space to cache partition rdd_23_3 in memory! Free memory is 236015334 bytes.
14/10/07 01:48:29 ERROR Executor: Exception in task 0.0 in stage 13.0 (TID 17)
java.lang.OutOfMemoryError: Java heap space
    at edu.berkeley.cs.amplab.sparkr.RRDD$$anon$2.read(RRDD.scala:144)
    at edu.berkeley.cs.amplab.sparkr.RRDD$$anon$2.<init>(RRDD.scala:156)
    at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:129)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
    at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:120)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
    at org.apache.spark.scheduler.Task.run(Task.scala:54)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:701)
14/10/07 01:48:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]

数据维度(样本):

data <- read.csv("/home/Henry/data.csv")

dim(data)

[1] 269369 541

我也尝试在本地文件系统和HDFS上托管相同的csv文件。我认为它需要更多的Hadoop数据集来存储大型数据集?如果是,那么我如何设置Spark Hadoop集群来摆脱这个问题。 (或者我做错了什么)

提示:我认为增加Java和Spark堆空间将帮助我运行它。我做了很多努力,但没有成功。任何人都可以知道增加堆空间的方法是什么。

1 个答案:

答案 0 :(得分:1)

您可以尝试将spark.executor.memory设置为更大的值here吗?作为包络计算,假设数据集中的每个条目占用4个字节,则内存中的整个文件将花费269369 * 541 * 4 bytes ~= 560MB,超过该参数的默认512m值。 / p>

例如,尝试类似的方法(假设群集中的每个工作节点都有超过1GB的可用内存):

sc <- sparkR.init("local[2]", "SparkR", "/home/spark",
                  list(spark.executor.memory="1g"))