Question

Spark Thrift服务器试图在通过JDBC传输之前将完整的数据集加载到内存中，在JDBC客户端上，我收到错误：

SQL Error: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 48 tasks (XX GB) is bigger than spark.driver.maxResultSize (XX GB)
  org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 48 tasks (XX GB) is bigger than spark.driver.maxResultSize (XX GB)
  org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 48 tasks (XX GB) is bigger than spark.driver.maxResultSize (XX GB)

查询：从表中选择*。是否可以为Thrift Server启用类似流模式的功能？主要目标-通过JDBC连接使用SparkSQL授予从Pentaho ETL到Hadoop集群的访问权限。但是，如果Thrift Server应该在传输之前将完整的数据集加载到内存中，则这种方法将行不通。

Answer 1

解决方案：spark.sql.thriftServer.incrementalCollect = true

Answer 2

我的情况是增加spark驱动程序的内存和最大结果大小，如spark.driver.memory = xG，spark.driver.maxResultSize = xG。根据{{3}}

Spark Thrift服务器在通过JDBC传输之前将完整的数据集加载到内存中

2 个答案: