我想编写自定义pyspark
序列化程序。除了详细信息here之外,网上的文档很少。
我的逻辑如下:
cPickle
。自定义序列化程序如下所示:
from pyspark.serializers import FramedSerializer
class CustomSerializer(FramedSerializer):
import cPickle as pickle
import CustomClass
def dumps(self, obj):
if isinstance(obj, CustomClass):
bytes_str = obj.serialize()
bytes_str = '\1' + bytes_str
elif isinstance(obj, CustomClass.Location):
bytes_str = obj.serialize()
bytes_str = '\2' + bytes_str
else:
bytes_str = pickle.dumps(obj)
bytes_str = '\0' + bytes_str
return bytes_str
def loads(self, bytes_str):
c = bytes_str[0]
if c=='\1':
obj = CustomClass()
obj.parse_from_string(bytes_str[1:])
elif c=='\2':
obj = CustomClass.Location()
obj.parse_from_string(bytes_str[1:])
else:
obj = pickle.loads(bytes_str[1:])
return obj
在启动SparkContext
时,我确保指定自定义序列化程序:
serializer = CustomSerializer()
sc = SparkContext(appName='MyApp', serializer=serializer)
但是,我仍然会收到错误:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/user/appcache/application_1507044435666_0035/container_1507044435666_0035_01_000002/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/mnt/yarn/usercache/user/appcache/application_1507044435666_0035/container_1507044435666_0035_01_000002/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/user/appcache/application_1507044435666_0035/container_1507044435666_0035_01_000002/pyspark.zip/pyspark/serializers.py", line 272, in dump_stream
bytes = self.serializer.dumps(vs)
File "<ipython-input-2-536808351108>", line 14, in dumps
PicklingError: Can't pickle <class 'CustomClass.Location'>: attribute lookup Location failed
我错过了什么?
感谢。