Question

我有2个数据帧在同一个col上被删除。

scala> (1 to 10).map(i => (i, "element"+i))
res21: scala.collection.immutable.IndexedSeq[(Int, String)] = Vector((1,element1), (2,element2), (3,element3), (4,element4), (5,element5), (6,element6), (7,element7), (8,element8), (9,element9), (10,element10))

scala> spark.createDataFrame(res21).toDF("a", "b")
res22: org.apache.spark.sql.DataFrame = [a: int, b: string]

scala> res22.write.bucketBy(2, "a").saveAsTable("tab1")
17/10/17 23:07:50 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`tab1` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

scala> res22.write.bucketBy(2, "a").saveAsTable("tab2")
17/10/17 23:07:54 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`tab2` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

如果我执行这些数据帧的联合，Spark将无法再避免随机播放。

scala> spark.table("tab1").union(spark.table("tab2")).groupBy("a").count().explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#149], functions=[count(1)], output=[a#149, count#166L])
+- Exchange hashpartitioning(a#149, 200)
   +- *HashAggregate(keys=[a#149], functions=[partial_count(1)], output=[a#149, count#172L])
      +- Union
         :- *FileScan parquet default.tab1[a#149] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
         +- *FileScan parquet default.tab2[a#154] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>

有解决方法吗？

如何在2个bucketized数据帧上使用union来避免spark shuffle

0 个答案: