我如何使用Spark ORC索引?

时间:2017-10-29 21:09:36

标签: apache-spark orc

从spark启用orc索引的选项是什么?

.option("index", uid)

我正在编造 //Assume the first one in the array the smallest first int smallest = b[0]; //Then iterate the loop and compare the values for(int i = 0; i < b.length; i++) { //if the value is smaller than the current smallest, then which is the smallest? if(b[i] < smallest) smallest = b[i]; } ,我必须将其放在那里以从orc索引列“user_id”。

2 个答案:

答案 0 :(得分:2)

您是否尝试过:.partitionBy("user_id")

 df
        .write()
        .option("mode", "DROPMALFORMED")
        .option("compression", "snappy")
        .mode("overwrite")
        .format("orc")
        .partitionBy("user_id")
        .save(...)

答案 1 :(得分:0)

根据原始博客文章,有关为Apache Spark提供ORC支持,有一个配置旋钮可在您的spark上下文中打开以启用ORC索引。

# enable filters in ORC
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

参考:https://databricks.com/blog/2015/07/16/joint-blog-post-bringing-orc-support-into-apache-spark.html