我使用spark进行一些计算。 基本上我做了两件事:
(你可能会问我为什么要循环读它。我这样做是因为某些原因: 这些文件不是一次出现的。实际上它会定期出现。所以我不能马上阅读它们。 虽然Stream可以做到这一点。我不想使用流。因为使用Stream I需要设置一个长窗口。调试和测试并不简单 )
代码如下:
# Get the file list in the HDFS directory
client = InsecureClient('http://10.79.148.184:50070')
file_list = client.list('/test')
df_total = None
counter = 0
for file in file_list:
counter += 1
# turn each file (CSV format) into data frame
lines = sc.textFile("/test/%s" % file)
parts = lines.map(lambda l: l.split(","))
rows = parts.map(lambda p: Row(router=p[0], interface=int(p[1]), protocol=p[7],bit=int(p[10])))
df = sqlContext.createDataFrame(rows)
# do some transform on the data frame
df_protocol = df.groupBy(['protocol']).agg(func.sum('bit').alias('bit'))
# add the current data frame to previous data frame set
if not df_total:
df_total = df_protocol
else:
df_total = df_total.unionAll(df_protocol)
# cache the df_total
df_total.cache()
if counter % 5 == 0:
df_total.rdd.checkpoint()
# get the df_total information
df_total.show()
我知道随着时间的推移,df_total可能很大。但实际上,在此之前,上面的代码已经引发异常。
当循环约为30个循环时。代码抛出GC开销限制超出了异常。该文件非常小,所以即使300个循环,数据大小也只能是几MB。我不知道为什么它会抛出GC错误。
例外情况如下:
Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.toString(Integer.java:331)
at java.lang.Integer.toString(Integer.java:739)
at java.lang.String.valueOf(String.java:2854)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at org.apache.spark.storage.RDDBlockId.name(BlockId.scala:53)
at org.apache.spark.storage.BlockId.equals(BlockId.scala:46)
at java.util.HashMap.getEntry(HashMap.java:471)
at java.util.HashMap.get(HashMap.java:421)
at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocations(BlockManagerMasterEndpoint.scala:371)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocationsMultipleBlockIds$1.apply(BlockManagerMasterEndpoint.scala:376)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocationsMultipleBlockIds$1.apply(BlockManagerMasterEndpoint.scala:376)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocationsMultipleBlockIds(BlockManagerMasterEndpoint.scala:376)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:72)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
16/04/20 09:52:00 ERROR TaskSchedulerImpl: Lost executor 0 on ES01: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/04/20 09:52:12 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=4721950849479578179, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=47]}} to ES01/10.79.148.184:53059; closing connection
io.netty.handler.codec.EncoderException: java.lang.OutOfMemoryError: Java heap space
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:691)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:626)
at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:602)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:204)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:256)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:136)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:127)
at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:77)
at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
... 13 more