Question

从spark启用orc索引的选项是什么？

.option("index", uid)

我正在编造//Assume the first one in the array the smallest first int smallest = b[0]; //Then iterate the loop and compare the values for(int i = 0; i < b.length; i++) { //if the value is smaller than the current smallest, then which is the smallest? if(b[i] < smallest) smallest = b[i]; }，我必须将其放在那里以从orc索引列“user_id”。

Answer 1

您是否尝试过：.partitionBy("user_id")？

 df
        .write()
        .option("mode", "DROPMALFORMED")
        .option("compression", "snappy")
        .mode("overwrite")
        .format("orc")
        .partitionBy("user_id")
        .save(...)

Answer 2

根据原始博客文章，有关为Apache Spark提供ORC支持，有一个配置旋钮可在您的spark上下文中打开以启用ORC索引。

# enable filters in ORC
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

参考：https://databricks.com/blog/2015/07/16/joint-blog-post-bringing-orc-support-into-apache-spark.html

我如何使用Spark ORC索引？

2 个答案: