Question

我正在尝试通过Spark加入一个带有1亿条记录的ORC文件的Dataframe（比如100条记录）（可以增加到4-5亿条，每条记录25条字节）。它也是使用Spark hiveContext API创建的。

ORC文件创建代码

//fsdtRdd is JavaRDD, fsdtSchema is StructType schema
DataFrame fsdtDf = hiveContext.createDataFrame(fsdtRdd,fsdtSchema);
fsdtDf.write().mode(SaveMode.Overwrite).orc("orcFileToRead");

ORC文件阅读代码

HiveContext hiveContext = new HiveContext(sparkContext);
DataFrame orcFileData= hiveContext.read().orc("orcFileToRead");
// allRecords is dataframe
DataFrame processDf = allRecords.join(orcFileData,allRecords.col("id").equalTo(orcFileData.col("id").as("ID")),"left_outer_join");
processDf.show();

Spark（从本地开始）记录日志

Input split: file:/C:/spark/orcFileToRead/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc:0+3163348
min key = null, max key = null
Reading ORC rows from file:/C:/spark/orcFileToRead/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc with {include: [true, true, true], offset: 0, length: 9223372036854775807}
Finished task 55.0 in stage 2.0 (TID 59). 2455 bytes result sent to driver
Starting task 56.0 in stage 2.0 (TID 60, localhost, partition 56,PROCESS_LOCAL, 2220 bytes)
Finished task 55.0 in stage 2.0 (TID 59) in 5846 ms on localhost (56/84)
Running task 56.0 in stage 2.0 (TID 60)

虽然Spark作业成功完成，但我认为，它无法利用ORC索引文件功能，因此在继续之前检查整个ORC数据块。

问题

- 这是正常行为，还是我必须在以ORC格式保存数据之前设置任何配置？

- 如果是 NORMAL ，最好的加入方式是什么，以便我们丢弃磁盘级别上的不匹配记录（可能只有索引文件为加载ORC数据）？

Spark log-＆＃34; min key = null，max key = null＆＃34;在阅读ORC文件时

0 个答案: