ValueError:在RDD中使用reduceByKey()时解包的值太多-PySpark

时间:2018-07-28 12:13:49

标签: python apache-spark pyspark

我有一段代码,其中RDD具有(a,b,c)形式的元组,其中c是数据帧本身。我只需要根据第二个值(b)对其进行分组,以使其看起来像[(b,)]。目前,我正在使用groupByKey()

  

rdd_new = rdd.map(lambda x:(x [1],x [0:]))。groupByKey()

这似乎很昂贵,所以我尝试将其转换为reduceByKey()。

  

rdd_new = rdd.reduceByKey(lambda x:(x [1],x [0:]))

我不知道出了什么问题,但是在收集rdd时出错了。

ValueError:太多值无法解包

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

请帮助!

0 个答案:

没有答案