我正在像这样的循环中过滤数据帧:(其中,计算和状态是字符串数组)
result_df = None
for c in calculations:
for s in statuses:
df \
.filter(f"""
...
""")
if not result_df:
result_df = df
else:
result_df = result_df.union(df)
我的代码在使用少量数据时有效,但是在处理大量数据时,我在stdout
中看到以下错误:
py4j.protocol.Py4JJavaErrorERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
: <exception str() failed>
这似乎是由我在stderr
中看到的此错误引起的:
java.lang.StackOverflowError
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:14)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
我已经看到一些帖子说这可能是由spark的递归引起的,但是我不知道如何解决此问题。为了解决此问题,我需要特别更改什么代码或配置?还有另一种更好的方法来执行这样的过滤吗?我需要在某个地方打电话给result_df.cache()
吗?我需要打电话给sc.setCheckpointDir("/")
吗?如果是这样,我还有什么需要做的检查点工作吗?谢谢。
此外,我尝试增加执行程序和驱动程序的内存,但没有任何改变。另外,我看过这篇文章here,但是答案并没有说明除了初始化初始化调用以外,还如何使用检查点。