腌制Spark RDD并将其读入Python

时间:2015-11-19 15:50:29

标签: python apache-spark pickle pyspark

我试图通过腌制来序列化Spark RDD,并将pickled文件直接读入Python。

a = sc.parallelize(['1','2','3','4','5'])
a.saveAsPickleFile('test_pkl')

然后我将test_pkl文件复制到我的本地。我怎样才能直接将它们读入Python?当我尝试正常的泡菜包时,当我尝试阅读' test_pkl'的第一个泡菜部分时,它失败了:

pickle.load(open('part-00000','rb'))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.6/pickle.py", line 1370, in load
    return Unpickler(file).load()
  File "/usr/lib64/python2.6/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib64/python2.6/pickle.py", line 970, in load_string
    raise ValueError, "insecure string pickle"
ValueError: insecure string pickle

我认为spark使用的酸洗方法与python pickle方法不同(如果我错了,请纠正我)。有什么方法可以让我从Spark中腌制数据并从文件中直接将这个pickled对象读入python中吗?

3 个答案:

答案 0 :(得分:2)

可以使用sparkpickle项目。就像

一样简单
with open("/path/to/file", "rb") as f:
    print(sparkpickle.load(f))

答案 1 :(得分:1)

更好的方法可能是挑选每个分区中的数据,对其进行编码,然后将其写入文本文件:

import cPickle
import base64

def partition_to_encoded_pickle_object(partition):
    p = [i for i in partition] # convert the RDD partition to a list
    p = cPickle.dumps(p, protocol=2) # pickle the list
    return [base64.b64encode(p)] # base64 encode the list, and return it in an iterable

my_rdd.mapPartitions(partition_to_encoded_pickle_object).saveAsTextFile("your/hdfs/path/")

将文件下载到本地目录后,可以使用以下代码段将其读入:

# you first need to download the file, this step is not shown
# afterwards, you can use 
path = "your/local/path/to/downloaded/files/"
data = []
for part in os.listdir(path):
    if part[0] != "_": # this prevents system generated files from getting read - e.g. "_SUCCESS"
        data += cPickle.loads(base64.b64decode((open(part,'rb').read())))

答案 2 :(得分:1)

问题是格式不是一个pickle文件。它是pickle objects的SequenceFile。 sequence file可以在Hadoop和Spark环境中打开,但并不意味着在python中使用并使用基于JVM的序列化来序列化,在这种情况下是字符串列表。