为什么Spark会在进行reducebykey转换时强制从元组列表构建RDD?
reduce_rdd = sc.parallelize([{'k1': 1}, {'k2': 2}, {'k1': -2}, {'k3': 4}, {'k2': -5}, {'k1': 4}])
print(reduce_rdd.reduceByKey(lambda x, y: x + y).take(100))
错误:
for k, v in iterator:
ValueError: need more than 1 value to unpack
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
如果reduceByKey()打算使用键值对的集合,那么对我来说,每个对都应该驻留在用于键值对的Python对象类型中,而字典不是元组。
答案 0 :(得分:5)
reducebykey
适用于配对RDD。配对RDD实际上是元组列表的分布式版本。由于这些数据结构可以轻松分区,因此它们是密钥:值数据的分布式计算的自然选择。
有些项目实现了IndexedRDD,但在撰写本文时尚未将这些项目集成到spark-core
代码中。如果您有兴趣,可以从这个Github存储库安装PySpark版本的IndexedRDD。
回到你的问题,没有IndexedRDD可以很容易地解决它:
reduce_rdd = sc.parallelize([{'k1': 1}, {'k2': 2}, {'k1': -2},
{'k3': 4}, {'k2': -5}, {'k1': 4}])
reduce_rdd.map(lambda x:x.items()[0]).reduceByKey(lambda x, y: x + y).collectAsMap()
返回以下输出:
{'k1': 3, 'k2': -3, 'k3': 4}