当我将local
的值设置为1时,操作正常,但是当设置为2时,错误消息报告如下
from pyspark import SparkContext
# Changing 1 to 2 will give you an error
sc = SparkContext("local[2]", "sort")
class MySort:
def __init__(self, tup):
self.tup = tup
def __gt__(self, other):
if self.tup[0] > other.tup[0]:
return True
elif self.tup[0] == other.tup[0]:
if self.tup[1] >= other.tup[1]:
return True
else:
return False
else:
return False
r1 = sc.parallelize([(1, 2), (2, 2), (2, 3), (2, 1), (1, 3)])
r2 = r1.sortBy(MySort)
print(r2.collect())
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "E:\spark2.3.1\spark-2.3.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 230, in main File "E:\spark2.3.1\spark-2.3.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 225, in process File "E:\spark2.3.1\spark-2.3.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 376, in dump_stream bytes = self.serializer.dumps(vs) File "E:\spark2.3.1\spark-2.3.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 555, in dumps return pickle.dumps(obj, protocol) _pickle.PicklingError: Can't pickle : attribute lookup MySort on __main__ failed
答案 0 :(得分:0)
我认为您需要在类中添加参数以火花提交:
--py-files your_file.py
因为spark需要将此类传递给其他工作人员。
答案 1 :(得分:0)
它的火花真的很有趣,我以前不知道。我认为当您使用单核时,类不会被腌制(在其他地方使用类需要腌制)。但是您仍然可以使用函数(我假设您按前两个值对值进行排序):
key_func = lambda tup : tup[:2]
r1 = sc.parallelize([(1, 2), (2, 2), (2, 3), (2, 1), (1, 3)])
r2 = r1.sortBy(key_func)
print(r2.collect())