Question

我正在尝试使用Scala连接两个大型Spark数据帧，但我无法使其表现良好。我真的希望有人能帮助我。

我有以下两个文本文件：

dfPerson.txt（PersonId：字符串，GroupId：字符串）200万行（100MB）
dfWorld.txt（PersonId：String，GroupId：String，PersonCharacteristic：String）300亿行（1TB）

首先，我将文本文件解析为Parquet并在GroupId上进行分区，该GroupId具有50个不同的值和一个休息组。

val dfPerson = spark.read.csv("input/dfPerson.txt")
dfPerson.write.partitionBy("GroupId").parquet("output/dfPerson")

val dfWorld = spark.read.csv("input/dfWorld.txt")
dfWorld.write.partitionBy("GroupId").parquet("output/dfWorld")

注意：一个GroupId最多可以包含1个PersonId，最多可以包含60亿个PersonId，因此，由于偏斜了，它可能不是最好的分区列，但这是我能想到的。

接下来，我阅读实木复合地板文件并将其加入，我采用以下方法：

方法1：基本的火花联接操作

val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
    dfPerson.as("p"),
    $"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
    "right"
  )
  .drop($"w.GroupId")
  .drop($"w.PersonId")

但是效果不佳，并且超过了1 TB的数据。

方法2：广播哈希加入

由于dfPerson可能很小，无法容纳在内存中，所以我认为这种方法可以解决我的问题

val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
    broadcast(dfPerson).as("p"),
    $"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
    "right"
  )
  .drop($"w.GroupId")
  .drop($"w.PersonId")

这也表现不佳，还混洗了1 TB以上的数据，这让我相信广播没有用吗？

方法3：对数据框进行存储和排序

我首先尝试对数据帧进行存储和排序，然后再写入镶木地板，然后加入：

val dfPersonInput = spark.read.csv("input/dfPerson.txt")
dfPersonInput
  .write
  .format("parquet")
  .partitionBy("GroupId")
  .bucketBy(4,"PersonId")
  .sortBy("PersonId")
  .mode("overwrite")
  .option("path", "output/dfPerson")
  .saveAsTable("dfPerson")
val dfPerson = spark.table("dfPerson")

val dfWorldInput = spark.read.csv("input/dfWorld.txt")
dfWorldInput
  .write
  .format("parquet")
  .partitionBy("GroupId")
  .bucketBy(4,"PersonId")
  .sortBy("PersonId")
  .mode("overwrite")
  .option("path", "output/dfWorld")
  .saveAsTable("dfWorld")
val dfWorld = spark.table("dfWorld")

dfWorld.as("w").join(
    dfPerson.as("p"),
    $"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
    "right"
  )
  .drop($"w.GroupId")
  .drop($"w.PersonId")

具有以下执行计划：

== Physical Plan ==
*(5) Project [PersonId#743]
+- SortMergeJoin [GroupId#73, PersonId#71], [GroupId#745, PersonId#743], RightOuter
   :- *(2) Sort [GroupId#73 ASC NULLS FIRST, PersonId#71 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(GroupId#73, PersonId#71, 200)
   :     +- *(1) Project [PersonId#71, PersonCharacteristic#72, GroupId#73]
   :        +- *(1) Filter isnotnull(PersonId#71)
   :           +- *(1) FileScan parquet default.dfWorld[PersonId#71,PersonCharacteristic#72,GroupId#73] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/F:/Output/dfWorld..., PartitionCount: 52, PartitionFilters: [isnotnull(GroupId#73)], PushedFilters: [IsNotNull(PersonId)], ReadSchema: struct<PersonId:string,PersonCharacteristic:string>, SelectedBucketsCount: 4 out of 4
   +- *(4) Sort [GroupId#745 ASC NULLS FIRST, PersonId#743 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(GroupId#745, PersonId#743, 200)
         +- *(3) FileScan parquet default.dfPerson[PersonId#743,GroupId#745] Batched: true, Format: Parquet, Location: CatalogFileIndex[file:/F:/Output/dfPerson], PartitionCount: 45, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<PersonId:string,GroupId:string>, SelectedBucketsCount: 4 out of 4

这也表现不好。

总结

所有方法大约需要150-200小时（基于24小时后火花作业的阶段和任务的进度），并遵循以下策略：

DAG visualization

我想分区，存储，分类拼花或所有这些都缺少一些东西。

任何帮助将不胜感激。

Answer 1

您要实现的目标是什么？为什么需要加入？

为了加入而加入将无济于事，除非您有足够的内存/磁盘空间来收集1TB x 100MB的数据

根据回复进行编辑

如果只需要与dfPerson中显示的人员相关的记录，则不需要右/左联接，那么内部联接就是您想要的。

仅当您的DF小于Spark中的广播设置（默认为10 Mb）时，广播才能工作，否则将被忽略。

dfPerson.as("p").join(
    dfWorld.select(
        $"GroupId", $"PersonId", 
        $"<feature1YouNeed>", $"<feature2YouNeed>" 
    ).as("w"), 
    Seq("GroupId", "PersonId")
)

这应该为您提供了功能，

NB：用实际的列名替换和。

使用Scala将两个大型Spark数据帧持久化在拼花地板中

1 个答案: