我是使用pyspark的新手,我一直在尝试运行一个在本地模式下运行良好的脚本,其中包含1000行数据子集,但现在在所有数据的独立模式下抛出错误,这是1GB。我认为这会发生在更多数据=更多问题,但我无法理解导致此问题的原因。这些是我的独立群集的详细信息:
脚本在我将spark数据帧转换为pandas数据帧以并行化某些操作的阶段抛出错误。我很困惑,这会导致问题,因为数据只有大约1G,我的执行器应该有更多的内存。这是我的代码段 - 错误发生在data = data.toPandas()
:
def num_cruncher(data, cols=[], target='RETAINED', lvl='univariate'):
if not cols:
cols = data.columns
del cols[data.columns.index(target)]
data = data.toPandas()
pop_mean = data.mean()[0]
if lvl=='univariate':
cols = sc.parallelize(cols)
all_df = cols.map(lambda x: calculate([x], data, target)).collect()
elif lvl=='bivariate':
cols = sc.parallelize(cols)
cols = cols.cartesian(cols).filter(lambda x: x[0]<x[1])
all_df = cols.map(lambda x: calculate(list(x), data, target)).collect()
elif lvl=='trivariate':
cols = sc.parallelize(cols)
cols = cols.cartesian(cols).cartesian(cols).filter(lambda x: x[0][0]<x[0][1] and x[0][0]<x[1] and x[0][1]<x[1]).map(lambda x: (x[0][0],x[0][1],x[1]))
all_df = cols.map(lambda x: calculate(list(x), data, target)).collect()
all_df = pd.concat(all_df)
return all_df, pop_mean
这是错误日志:
16/07/11 09:49:54 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:258)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:310)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:257)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:256)
at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:588)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:577)
at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:170)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:104)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
所以我的问题是:
答案 0 :(得分:3)
对于那些可能发现这篇文章有用的人来说 - 似乎问题不是给工人/奴隶更多的记忆,而是给予司机更多的记忆,正如@KartikKannapur的评论中提到的那样。所以为了解决这个问题,我设置了:
spark.driver.maxResultSize 3g
spark.driver.memory 8g
spark.executor.memory 4g
可能有点矫枉过正,但它现在可以胜任。