PySpark RuntimeError:在迭代期间设置更改的大小

时间:2017-03-24 02:13:48

标签: python apache-spark pyspark

我正在运行一个pyspark脚本并在下面遇到错误。由于我的代码“如果len(rdd.take(1))> 0:”,似乎说“RuntimeError:在迭代期间设置改变的大小”。我不确定这是不是真正的原因,并想知道到底出了什么问题。任何帮助将不胜感激。

谢谢!

17/03/23 21:54:17 INFO DStreamGraph: Updated checkpoint data for time 1490320070000 ms
17/03/23 21:54:17 INFO JobScheduler: Finished job streaming job 1490320072000 ms.0 from job set of time 1490320072000 ms
17/03/23 21:54:17 INFO JobScheduler: Starting job streaming job 1490320072000 ms.1 from job set of time 1490320072000 ms
17/03/23 21:54:17 ERROR JobScheduler: Error running job streaming job 1490320072000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/util.py",
     

第65行,正在通话中           r = self.func(t,* rdds)         文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py”,   第159行,在           func = lambda t,rdd:old_func(rdd)         文件“/home/richard/Documents/spark_code/with_kafka/./mongo_kafka_spark_script.py”,   第96行,在_compute_glb_max中           如果len(rdd.take(1))> 0:         文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第1343行,           res = self.context.runJob(self,takeUpToNumLeft,p)         在runJob中输入文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py”,第965行           port = self._jvm.PythonRDD.runJob(self._jsc.sc(),mappedRDD._jrdd,partitions)         在_jrdd中输入文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2439行           self._jrdd_deserializer,profiler)         在_wrap_function中输入文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2372行           pickled_command,broadcast_vars,env,includes = _prepare_for_python_RDD(sc,command)         在_prepare_for_python_RDD中输入文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2363行           broadcast_vars = [sc._pickled_broadcast_vars中x的x._jbroadcast]       RuntimeError:在迭代期间设置更改的大小

  at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95)
  at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78)
  at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179)
  at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
  at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
  at scala.util.Try$.apply(Try.scala:192)
  at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
  at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:254)
  at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254)
  at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254)
  at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
  at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:253)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
  File "/home/richard/Documents/spark_code/with_kafka/./mongo_kafka_spark_script.py",
     

第224行,在           ssc.awaitTermination();         文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/context.py”,   第206行,在awaitTermination中         文件“/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”,   第1133行,致电         文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py”,第63行,   装饰         文件“/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py”,行   319,在get_return_value中       py4j.protocol.Py4JJavaError:调用o38.awaitTermination时发生错误。       :org.apache.spark.SparkException:Python引发了一个异常:       Traceback(最近一次调用最后一次):         文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/util.py”,   第65行,正在通话中           r = self.func(t,* rdds)         文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py”,   第159行,在           func = lambda t,rdd:old_func(rdd)         文件“/home/richard/Documents/spark_code/with_kafka/./mongo_kafka_spark_script.py”,   第96行,在_compute_glb_max中           如果len(rdd.take(1))> 0:         文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第1343行,           res = self.context.runJob(self,takeUpToNumLeft,p)         在runJob中输入文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py”,第965行           port = self._jvm.PythonRDD.runJob(self._jsc.sc(),mappedRDD._jrdd,partitions)         在_jrdd中输入文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2439行           self._jrdd_deserializer,profiler)         在_wrap_function中输入文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2372行           pickled_command,broadcast_vars,env,includes = _prepare_for_python_RDD(sc,command)         在_prepare_for_python_RDD中输入文件“/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第2363行           broadcast_vars = [sc._pickled_broadcast_vars中x的x._jbroadcast]       RuntimeError:在迭代期间设置更改的大小

  at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95)
  at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78)
  at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179)
  at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:179)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
  at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
  at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
  at scala.util.Try$.apply(Try.scala:192)
  at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
  at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:254)
  at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254)
  at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254)
  at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
  at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:253)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

2 个答案:

答案 0 :(得分:2)

似乎不是在迭代中创建广播变量的最佳实践。如果需要有状态数据,请始终使用updateStateByKey。

答案 1 :(得分:1)

尝试

select-object -unique

take()可以给出异常,但是,如果有更多详细信息可用,我们可以查明错误。