我正在通过Spark的python API运行一个简单的例子:
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
def f(x): return x
def add(a, b): return a + str(b)
sorted(x.combineByKey(str, add, add).collect())
本地模式(Spark 1.0和1.1)都没有问题,但在群集模式下会发生错误。一段问题追溯信息如下。在测试RDD函数cogroup()
时,它也会显示类似的问题。这是我第一次通过Spark的API(Python)。
你有什么想法吗?
[duplicate 561]
14/12/19 23:04:53 INFO TaskSetManager: Loss was due to org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.1.4-1.cdh5.1.4.p0.15/lib/spark/python/pyspark/worker.py", line 77, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/cloudera/parcels/CDH-5.1.4-1.cdh5.1.4.p0.15/lib/spark/python/pyspark/rdd.py", line 1404, in pipeline_func
return func(split, prev_func(split, iterator))
File "/opt/cloudera/parcels/CDH-5.1.4-1.cdh5.1.4.p0.15/lib/spark/python/pyspark/rdd.py", line 283, in func
def func(s, iterator): return f(iterator)
File "/opt/cloudera/parcels/CDH-5.1.4-1.cdh5.1.4.p0.15/lib/spark/python/pyspark/rdd.py", line 1118, in combineLocally
combiners = {}
AttributeError: 'cell' object has no attribute 'iteritems'