Question

我有一个Hive表，它有很多小的镶木地板文件，我正在创建一个Spark数据框，用SparkSQL做一些处理。由于我有大量的分割/文件，我的Spark作业创建了许多我不想要的任务。基本上我想要的是与Hive提供的功能相同，即通过指定最大分割大小设置将这些小输入分割组合成更大的分割。我怎样才能通过Spark实现这一目标？我尝试使用coalesce函数，但我只能指定它的分区数（我只能用它来控制输出文件的数量）。相反，我真的想要控制任务处理的（组合）输入分割大小。

编辑：我使用Spark本身，而不是Spark上的Hive。

编辑2：这是我当前的代码：

//create a data frame from a test table
val df = sqlContext.table("schema.test_table").filter($"my_partition_column" === "12345")

//coalesce it to a fixed number of partitions. But as I said in my question 
//with coalesce I cannot control the file sizes, I can only specify 
//the number of partitions
df.coalesce(8).write.mode(org.apache.spark.sql.SaveMode.Overwrite)
.insertInto("schema.test_table")

Answer 1

我还没试过但是在入门指南中读取它，设置此属性应该有效＆＃34; hive.merge.sparkfiles = true＆＃34;

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

如果在Hive上使用Spark，Spark的抽象并不能提供明确的数据拆分。但是，我们可以通过多种方式控制并行性。

您可以利用DataFrame.repartition（numPartitions：Int）显式控制分区数。
如果您使用的是Hive Context，请确保hive-site.xml包含CombinedInputFormat。这可能有所帮助。

有关详细信息，请查看以下有关Spark数据并行性的文档 - http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism。

如何将小镶木地板文件与Spark结合起来？

1 个答案: