从spark启用orc索引的选项是什么?
.option("index", uid)
我正在编造 //Assume the first one in the array the smallest first
int smallest = b[0];
//Then iterate the loop and compare the values
for(int i = 0; i < b.length; i++) {
//if the value is smaller than the current smallest, then which is the smallest?
if(b[i] < smallest)
smallest = b[i];
}
,我必须将其放在那里以从orc索引列“user_id”。
答案 0 :(得分:2)
您是否尝试过:.partitionBy("user_id")
?
df
.write()
.option("mode", "DROPMALFORMED")
.option("compression", "snappy")
.mode("overwrite")
.format("orc")
.partitionBy("user_id")
.save(...)
答案 1 :(得分:0)
根据原始博客文章,有关为Apache Spark提供ORC支持,有一个配置旋钮可在您的spark上下文中打开以启用ORC索引。
# enable filters in ORC
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
参考:https://databricks.com/blog/2015/07/16/joint-blog-post-bringing-orc-support-into-apache-spark.html