如何在不使用collec()

时间:2016-02-12 08:36:26

标签: java scala memory-leaks

我在scala中列出[CassandraRow]的RDD [CassadraRow]。 在下面的代码中我遇到内存泄漏问题:

val rowKeyRdd: Array[CassandraRow] =
sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress").collect()

val clientPartitionKeys = rowKeyRdd.map(x => ClientPartitionKey(
x.getString("customer_id"), x.getString("uniqueaddress"))).toList

val clientRdd: RDD[CassandraRow] =
sc.parallelize(clientPartitionKeys).joinWithCassandraTable(keyspace, table)
  .where("eventtime >= ?", startDate)
  .where("eventtime <= ?", endDate)
  .map(x => x._2)

clientRdd.cache()

我已经删除了缓存(),然后仍然遇到问题。

 org.jboss.netty.channel.socket.nio.AbstractNioSelector
 WARNING: Unexpected exception in the selector loop.
 java.lang.OutOfMemoryError: Java heap space
at org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42)
at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34)
at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134)
at org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
at org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:80)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR 2016-02-12 07:54:48 akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]

java.lang.OutOfMemoryError:超出GC开销限制

如何避免内存泄漏。我试过每个核心8GB。 和表包含数百万条记录。

1 个答案:

答案 0 :(得分:1)

在这一行中,你的变量名称表明你有一个RDD,但事实上,因为你使用collect()它不是RDD,正如你的类型声明所示,它是一个数组:

val rowKeyRdd: Array[CassandraRow] =
  sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress").collect()

这会将工作人员的所有数据提取到Driver程序中,因此工作人员的内存量(每个核心8GB)不是问题,驱动程序中没有足够的内存来处理此收集。

由于你对这些数据所做的只是映射它,然后将它重新并行化回RDD,而你应该映射它而不需要调用collect()。我没有尝试过以下代码,因为我无法访问您的数据集,但它应该大致正确:

val rowKeyRdd: RDD[CassandraRow] =
sc.cassandraTable(keyspace, table).select("customer_id", "uniqueaddress")

val clientPartitionKeysRDD = rowKeyRdd.map(x => ClientPartitionKey(
x.getString("customer_id"), x.getString("uniqueaddress")))

val clientRdd: RDD[CassandraRow] =
clientPartitionKeysRDD.joinWithCassandraTable(keyspace, table)
  .where("eventtime >= ?", startDate)
  .where("eventtime <= ?", endDate)
  .map(x => x._2)

clientRdd.cache()