Spark:在静态循环中过滤会导致java.lang.StackOverflowError

时间:2019-11-22 13:50:49

标签: python apache-spark pyspark apache-spark-sql

我正在像这样的循环中过滤数据帧:(其中,计算和状态是字符串数组)

result_df = None
for c in calculations:
    for s in statuses:
            df \
                .filter(f"""
                    ...
                """)
            if not result_df:
                result_df = df
            else:
                result_df = result_df.union(df)

我的代码在使用少量数据时有效,但是在处理大量数据时,我在stdout中看到以下错误:

py4j.protocol.Py4JJavaErrorERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
: <exception str() failed>

这似乎是由我在stderr中看到的此错误引起的:

java.lang.StackOverflowError
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:14)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:32)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:46)
    at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:39)

我已经看到一些帖子说这可能是由spark的递归引起的,但是我不知道如何解决此问题。为了解决此问题,我需要特别更改什么代码或配置?还有另一种更好的方法来执行这样的过滤吗?我需要在某个地方打电话给result_df.cache()吗?我需要打电话给sc.setCheckpointDir("/")吗?如果是这样,我还有什么需要做的检查点工作吗?谢谢。

此外,我尝试增加执行程序和驱动程序的内存,但没有任何改变。另外,我看过这篇文章here,但是答案并没有说明除了初始化初始化调用以外,还如何使用检查点。

0 个答案:

没有答案