如何将同一个键的记录写入多个文件(自定义分区程序)

时间:2017-11-20 15:38:30

标签: apache-spark

我想使用Spark动态地将目录中的数据写入分区。 这是示例代码。

val input_DF = spark.read.parquet("input path")
input_DF.write.mode("overwrite").partitionBy("colname").parquet("output path...")

如下所示,每个键的记录数不同,键有倾斜。 input_DF.groupBy($ “ colname的”)。AGG(计数( “ colname的”))。显示()

+-----------------+------------------------+
|colname          |count(colname)          |
+-----------------+------------------------+
|               NA|                14859816|  --> More no of records
|                A|                 2907930|
|                D|                 1118504|
|                B|                  485151|
|                C|                  435305|
|                F|                  370095|
|                G|                  170060|
+-----------------+------------------------+

因此,当为每个执行者提供合理的内存(8GB)时,作业失败。当给出每个遗嘱执行人的高记忆(15GB)时,工作成功完成,但完成时间太长。

我尝试过使用重新分区,期望它会在分区之间均匀分布数据。但是,由于它使用默认的HashPartitioner,因此密钥的记录将转到单个分区。

repartition(num partition,$"colname")  --> Creates HashPartition 

但这是在repartiton中提到的创建num部分文件,但是将一个键的所有记录移动到一个分区(所有具有col值NA的记录都转到一个分区)。剩余的零件文件没有记录(只有Parquet元数据,38364字节)。

        -rw-r--r--   2 hadoop hadoop          0 2017-11-20 14:29 /user/hadoop/table/_SUCCESS
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00000-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00001-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00002-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00003-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:07 /user/hadoop/table/part-r-00004-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00005-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00006-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop 1038264502 2017-11-20 13:20 /user/hadoop/table/part-r-00007-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00008-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00009-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00010-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00011-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00012-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00013-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00014-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop  128212247 2017-11-20 13:09 /user/hadoop/table/part-r-00015-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00016-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00017-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00018-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop  117142244 2017-11-20 13:08 /user/hadoop/table/part-r-00019-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00020-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop  347033731 2017-11-20 13:11 /user/hadoop/table/part-r-00021-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00022-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00023-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00024-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop  100306686 2017-11-20 13:08 /user/hadoop/table/part-r-00025-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop   36961707 2017-11-20 13:07 /user/hadoop/table/part-r-00026-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00027-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00028-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00029-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00030-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00031-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00032-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00033-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00034-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00035-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00036-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:07 /user/hadoop/table/part-r-00037-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00038-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00039-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00040-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00041-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      68859 2017-11-20 13:06 /user/hadoop/table/part-r-00042-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop 4031720288 2017-11-20 14:29 /user/hadoop/table/part-r-00043-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00044-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00045-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00046-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00047-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00048-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00049-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet
        -rw-r--r--   2 hadoop hadoop      38634 2017-11-20 13:06 /user/hadoop/table/part-r-00050-5736c707-fad4-4ba7-ab38-226cfbc3bf10.snappy.parquet

我想知道

  1. 有没有办法将相同的密钥记录写入DataFrame / RDD的不同分区?可能是自定义分区程序将每个第N条记录写入第N个分区

    (1st rec to partition 1)
    (2nd rec to partition 2)
    (3rd rec to partition 3)
    (4th rec to partition 4)
    (5th rec to partition 1)
    (6th rec to partition 2)
    (7th rec to partition 3)
    (8th rec to partition 4) 
    
  2. 如果是,可以使用DataFrame / RDD每个分区的最大字节数等参数进行控制。

  3. 由于预期的结果只是基于密钥将数据写入不同的子目录(Hive的分区),我想通过将密钥的记录分发给多个任务来编写数据,每个任务都写一个部分文件。目录。

1 个答案:

答案 0 :(得分:0)

在对唯一键进行修复时,问题得以解决,而不是在" partitionBy"中使用的密钥。如果dataFrame由于某种原因缺少唯一,可以使用

添加sudo列
df.withColumn("Unique_ID", monotonicallyIncreasingId)

然后在" Unique_ID"上进行修复,这样我们就可以将数据均匀地分配到多个分区。为了进一步提高性能,可以使用用于连接/组/分区的密钥的DataFrame分区对数据进行排序