受到这个question的启发,我编写了一些代码来存储RDD(从Parquet文件中读取),一个Schema of(photo_id,data),成对,由制表符分隔,就像详细信息库64对其进行编码,如下所示:
def do_pipeline(itr):
...
item_id = x.photo_id
def toTabCSVLine(data):
return '\t'.join(str(d) for d in data)
serialize_vec_b64pkl = lambda x: (x[0], base64.b64encode(cPickle.dumps(x[1])))
def format(data):
return toTabCSVLine(serialize_vec_b64pkl(data))
dataset = sqlContext.read.parquet('mydir')
lines = dataset.map(format)
lines.saveAsTextFile('outdir')
现在,感兴趣点: 如何阅读该数据集 并打印例如其反序列化数据?
我正在使用Python 2.6.6。
我的尝试就在这里,只是为了验证一切都可以完成,我写了这段代码:
deserialize_vec_b64pkl = lambda x: (x[0], cPickle.loads(base64.b64decode(x[1])))
base64_dataset = sc.textFile('outdir')
collected_base64_dataset = base64_dataset.collect()
print(deserialize_vec_b64pkl(collected_base64_dataset[0].split('\t')))
调用collect(),这对于测试来说没问题,但在真实场景中会很难......
编辑:
当我尝试使用zero323的建议时:
foo = (base64_dataset.map(str.split).map(deserialize_vec_b64pkl)).collect()
我收到了这个错误,归结为:
PythonRDD[2] at RDD at PythonRDD.scala:43
16/08/04 18:32:30 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4, gsta31695.tan.ygrid.yahoo.com): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/grid/0/tmp/yarn-local/usercache/gsamaras/appcache/application_1470212406507_56888/container_e04_1470212406507_56888_01_000009/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/grid/0/tmp/yarn-local/usercache/gsamaras/appcache/application_1470212406507_56888/container_e04_1470212406507_56888_01_000009/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/grid/0/tmp/yarn-local/usercache/gsamaras/appcache/application_1470212406507_56888/container_e04_1470212406507_56888_01_000009/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
UnpicklingError: NEWOBJ class argument has NULL tp_new
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/08/04 18:32:30 ERROR TaskSetManager: Task 12 in stage 0.0 failed 4 times; aborting job
16/08/04 18:32:31 WARN TaskSetManager: Lost task 14.3 in stage 0.0 (TID 38, gsta31695.tan.ygrid.yahoo.com): TaskKilled (killed intentionally)
16/08/04 18:32:31 WARN TaskSetManager: Lost task 13.3 in stage 0.0 (TID 39, gsta31695.tan.ygrid.yahoo.com): TaskKilled (killed intentionally)
16/08/04 18:32:31 WARN TaskSetManager: Lost task 16.3 in stage 0.0 (TID 42, gsta31695.tan.ygrid.yahoo.com): TaskKilled (killed intentionally)
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
/homes/gsamaras/code/read_and_print.py in <module>()
17 print(base64_dataset.map(str.split).map(deserialize_vec_b64pkl))
18
---> 19 foo = (base64_dataset.map(str.split).map(deserialize_vec_b64pkl)).collect()
20 print(foo)
/home/gs/spark/current/python/lib/pyspark.zip/pyspark/rdd.py in collect(self)
769 """
770 with SCCallSiteSync(self.context) as css:
--> 771 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
772 return list(_load_from_socket(port, self._jrdd_deserializer))
773
/home/gs/spark/current/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/home/gs/spark/current/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
答案 0 :(得分:2)
我们来试试一个简单的例子。为方便起见,我将使用方便的toolz
库,但这里并不需要它。
import sys
import base64
if sys.version_info < (3, ):
import cPickle as pickle
else:
import pickle
from toolz.functoolz import compose
rdd = sc.parallelize([(1, {"foo": "bar"}), (2, {"bar": "foo"})])
现在,您的代码现在不是完全可移植的。在Python 2中base64.b64encode
返回str
,而在Python 3中它返回bytes
。让我们说明:
Python 2
type(base64.b64encode(pickle.dumps({"foo": "bar"})))
## str
Python 3
type(base64.b64encode(pickle.dumps({"foo": "bar"})))
## bytes
因此,我们将解码添加到管道中:
# Equivalent to
# def pickle_and_b64(x):
# return base64.b64encode(pickle.dumps(x)).decode("ascii")
pickle_and_b64 = compose(
lambda x: x.decode("ascii"),
base64.b64encode,
pickle.dumps
)
请注意,这并不假设数据的任何特定形状。因此,我们将使用mapValues
仅序列化密钥:
serialized = rdd.mapValues(pickle_and_b64)
serialized.first()
## 1, u'KGRwMApTJ2ZvbycKcDEKUydiYXInCnAyCnMu')
现在我们可以使用简单的格式跟随它并保存:
from tempfile import mkdtemp
import os
outdir = os.path.join(mkdtemp(), "foo")
serialized.map(lambda x: "{0}\t{1}".format(*x)).saveAsTextFile(outdir)
要阅读文件,我们会颠倒过程:
# Equivalent to
# def b64_and_unpickle(x):
# return pickle.loads(base64.b64decode(x))
b64_and_unpickle = compose(
pickle.loads,
base64.b64decode
)
decoded = (sc.textFile(outdir)
.map(lambda x: x.split("\t")) # In Python 3 we could simply use str.split
.mapValues(b64_and_unpickle))
decoded.first()
## (u'1', {'foo': 'bar'})