我正在尝试通过Pyspark学习CombineByKey。基本上只是使用CombineByKey重新创建groupByKey。
我的Rdd具有这样的对象,存储在pairRdds中。每对都是存储在类(MyClass)变量myObj中的对象。
myObj示例:obj1 = ((a, b),(A,0,0))
其中键是(a,b)
,值是(A,0,0)
说我的示例RDD如下:
Rdd = [((a, b),(A,3,0)), ((a, b),(B,2,7)), ((a, c),(C,5,2)), ((a, d),(D,8,6))]
我想要的最终输出为:
Output = [((a, b),[(A,3,0), (B,2,7)]),((a, c),(C,5,2)), ((a, d),(D,8,6))]
下面是一些示例:`combineByKey`, pyspark 和 Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?
comb = cell_location_flat.combineByKey(
lambda row: [row],
lambda rows, row: rows + [row],
lambda rows1, rows2: rows1 + rows2,
)
print(comb.collect())
我收到以下错误消息。
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 352, in func
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1861, in combineLocally
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues
for k, v in iterator:
TypeError: 'MyClass' object is not iterable
关于我在做什么错的任何想法? 感谢您的任何答复!