使用字典

时间:2018-01-18 11:39:08

标签: python apache-spark pyspark rdd reduce

为什么Spark会在进行reducebykey转换时强制从元组列表构建RDD?

reduce_rdd = sc.parallelize([{'k1': 1}, {'k2': 2}, {'k1': -2}, {'k3': 4}, {'k2': -5}, {'k1': 4}])
print(reduce_rdd.reduceByKey(lambda x, y: x + y).take(100))

错误:

for k, v in iterator:
ValueError: need more than 1 value to unpack

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

如果reduceByKey()打算使用键值对的集合,那么对我来说,每个对都应该驻留在用于键值对的Python对象类型中,而字典不是元组。

1 个答案:

答案 0 :(得分:5)

reducebykey适用于配对RDD。配对RDD实际上是元组列表的分布式版本。由于这些数据结构可以轻松分区,因此它们是密钥:值数据的分布式计算的自然选择。

有些项目实现了IndexedRDD,但在撰写本文时尚未将这些项目集成到spark-core代码中。如果您有兴趣,可以从这个Github存储库安装PySpark版本的IndexedRDD。

回到你的问题,没有IndexedRDD可以很容易地解决它:

reduce_rdd = sc.parallelize([{'k1': 1}, {'k2': 2}, {'k1': -2}, 
                             {'k3': 4}, {'k2': -5}, {'k1': 4}])
reduce_rdd.map(lambda x:x.items()[0]).reduceByKey(lambda x, y: x + y).collectAsMap()

返回以下输出:

{'k1': 3, 'k2': -3, 'k3': 4}