我正在使用spark 1.5(pyspark),Python版本2.7.3,zlib版本是1.0,具有以下配置:
6台机器,每台180 GB Ram。
Spark-submit --master yarn --num-executors 10 --executor-memory 35G --driver-memory 6G
我有时会遇到OverflowError: size does not fit in an int
异常而且工作失败。
正如我所看到的,异常与我的代码无关,而是在将数据从服务器移动到服务器时激发机制来压缩数据。
我在网上找不到任何解决方案,但我试图更改压缩类型,但它没有帮助。我还试图通过设置为false spark.rdd.compress, spark.shuffle.compress, spark.shuffle.spill.compress, spark.broadcast.compress
来禁用压缩,但它也没有帮助。有什么工作吗?
继承人堆栈跟踪:
17/01/02 13:16:27 WARN TaskSetManager: Lost task 313.0 in stage 10.0 (TID 1532, server01.dom): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2355, in pipeline_func
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2355, in pipeline_func
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2355, in pipeline_func
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2355, in pipeline_func
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2355, in pipeline_func
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2355, in pipeline_func
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1789, in _mergeCombiners
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 317, in mergeCombiners
self._spill()
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 345, in _spill
self.serializer.dump_stream([(k, v)], streams[h])
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 267, in dump_stream
bytes = self.serializer.dumps(vs)
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 487, in dumps
return zlib.compress(self.serializer.dumps(obj), 1)
OverflowError: size does not fit in an int