Question

在过去的几天里，我在将数据集另存为拼花格式时遇到了一些错误。对于我们系统中的多个租户，正确执行了失败步骤，但是由于未知原因，在处理其中一个租户时，我总是失败。

当进程失败时，它将生成一个java_pid ######。hprof文件，我已经使用Eclipse Memory Analyzer分析了该文件，并且有两个类使用了最多的所有内存：

org.apache.spark.sql.catalyst.InternalRow（50％）
org.apache.spark.sql.catalyst.plans.logical.Project（28％）

以下是从Eclipse Memory Analyzer屏幕中截取的屏幕截图。

配置：

独立运行的Apache Spark 2.1.1
作业服务器0.8.0
JOBSERVER_MEMORY = 4G
每节点内存= 10G

这是我第一次分析hprof文件，但我不确定如何解释结果。发生错误的代码片段如下：

Dataset<Row> dsCube = sqlContext.sql(cube_view_query);
    /* I'm doing this loop because there are some entities which are 
    imported from a csv file and the inferSchema doesn't work correctly. 
    Probably this is the reason why the org.apache.spark.sql.catalyst.plans.logical.Project 
    is taking so much memory as in this case this loops have about 400 iterations*/
for (ColumnMetadata columnMetadata : columnsMetadata) {
        dsCube = dsCube 
                .withColumn(columnMetadata.getField(), df.col(columnMetadata.getField())
                        .cast(columnMetadata.GetSparkDataType()));
    }
    dsCube 
            .write()
            .mode(saveMode)
            .parquet(parquetLocation);
}

执行的查询如下：

    SELECT
    [ABOUT 400 COLUMNS]
    FROM
    ((`users` `u`
    LEFT JOIN `user_profiles` `up` ON (((`u`.`tenant_id` = `up`.`tenant_id`)
        AND (`u`.`user_id` = `up`.`user_id`)
        AND (`u`.`user_domain` = `up`.`user_domain`))))
    LEFT JOIN `user_custom_profiles` `ucp` ON (((`u`.`tenant_id` = `ucp`.`tenant_id`)
        AND (`u`.`user_domain` = `ucp`.`user_domain`)
        AND (`u`.`user_id` = `ucp`.`user_id`))))

是因为spark将整个数据集存储在内存中，是org.apache.spark.sql.catalyst.InternalRow占用大量内存的原因吗？

有人可以带我来这里吗？

由于InternalRow

0 个答案: