PySpark PicklingError

时间:2016-08-06 16:56:19

标签: apache-kafka pyspark pickle rdd

我看到一个酸洗错误:

  

无法挑选对象,因为需要进行过深的递归。

以下是追溯:

Traceback (most recent call last):
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", line 62, in call
    r = self.func(t, *rdds)
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 159, in 
    func = lambda t, rdd: old_func(rdd)
    if rdd.count() > 0:
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1006, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 997, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 871, in fold
    vals = self.mapPartitions(func).collect()
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 773, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, in _jrdd
    pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, command, self)
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2308, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 428, in dumps
    return cloudpickle.dumps(obj, 2)
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 646, in dumps
    cp.dump(obj)
  File "/usr/hdp/current/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 111, in dump
    raise pickle.PicklingError(msg)
   PicklingError: Could not pickle object as excessively deep recursion required.

以下是我的代码inn highlevel的一部分导致错误:

sc = SparkContext(appName="my_app")

ssc = StreamingContext(sc, 1)

kafka_stream = KafkaUtils.createDirectStream(ssc, full_topic_list, kafka_params, fromOffsets=offset_dict)

messages = kafka_stream.map(lambda (k, v): json.loads(v))

messages.foreachRDD(lambda rdd: process(rdd, topic_list, sqlcontext))

在我的流程函数中,有一个rdd计数:if topic_rdd.count() > 0,它会抛出错误。

1 个答案:

答案 0 :(得分:0)

您无法将RDD传递给分布式函数(映射,缩小等)中的RDD并进行处理。