rdd.collect()对于pyspark中的大rdd失败

时间:2019-07-17 14:10:03

标签: pyspark rdd

在以下情况下在pyspark中处理大RDD时遇到错误,
a)尝试使用'.collect()'打印rdd的内容 b)尝试将rdd存储为“ arr = np.array(rdd.collect())”在numpy数组中

相同的逻辑适用于相对较小的rdd

print(dsmRDD.collect()) or
arr = np.array(rdd.collect()) both show error

错误看起来像

--- Logging error ---
ERROR:py4j.java_gateway:An error occurred while trying to connect 
to the Java server (127.0.0.1:37353)
Traceback (most recent call last):
File "/home/sukriti/.local/lib/python3.5/site- 
packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-5a79f087d967>", line 76, in <module>
print(xySPT.collect())
File "/home/sukriti/spark/python/pyspark/rdd.py", line 834, in 
collect
sock_info 
=self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/home/sukriti/spark/python/lib/py4j-0.10.7- 
src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/sukriti/spark/python/pyspark/sql/utils.py", line 63, 
in deco return f(*a, **kw)
File "/home/sukriti/spark/python/lib/py4j-0.10.7- 
src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: <unprintable Py4JJavaError object>

0 个答案:

没有答案