火花写镶木地板与分区非常慢

时间:2020-01-31 05:32:18

标签: apache-spark parquet

当用partitionby写入镶木地板时,它花费了更多时间。分析日志我发现spark正在目录中列出文件,并且在列出文件后,我观察到以下行为,这花费了一个多小时,并且似乎处于空闲状态,并且再次开始。

20/01/30 07:33:09 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/01/30 07:33:09 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
20/01/30 07:33:09 INFO Executor: Finished task 195.0 in stage 241.0 (TID 15820). 18200 bytes result sent to driver
20/01/30 07:33:09 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/01/30 07:33:09 INFO Executor: Finished task 198.0 in stage 241.0 (TID 15823). 18200 bytes result sent to driver
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedDirectMemory, value=50331648
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedHeapMemory, value=50331648
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedDirectMemory, value=50331648
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedHeapMemory, value=50331648

再次

20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.compilationTime, count=484, min=2, max=622, mean=16.558694661661132, stddev=13.859676272407238, median=12.0, p75=20.0, p95=47.0, p98=62.0, p99=64.0, p999=70.0
20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.generatedClassSize, count=990, min=546, max=97043, mean=2058.574386565769, stddev=2153.50835266105, median=1374.0, p75=2693.0, p95=5009.0, p98=11509.0, p99=11519.0, p999=11519.0
20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.generatedMethodSize, count=4854, min=1, max=1574, mean=95.19245880884911, stddev=158.289763457333, median=39.0, p75=142.0, p95=339.0, p98=618.0, p99=873.0, p999=1234.0
20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.sourceCodeSize, count=484, min=430, max=467509, mean=4743.632894656119, stddev=5893.941708479697, median=2346.0, p75=4946.0, p95=24887.0, p98=24890.0, p99=24890.0, p999=24890.0
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedDirectMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedHeapMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedDirectMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedHeapMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.executor.filesystem.file.largeRead_ops, value=0
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.executor.filesystem.file.read_bytes, value=0

再次

20/01/30 08:55:28 INFO TaskMemoryManager: Memory used in task 15249
20/01/30 08:55:28 INFO TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@3cadc5a3: 65.0 MB
20/01/30 08:55:28 INFO TaskMemoryManager: Acquired by HybridRowQueue(org.apache.spark.memory.TaskMemoryManager@7c64db53,/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1577238363313_38955/spark-487c8d3d-391c-47b3-9a1b-d816d9505f5c,11,org.apache.spark.serializer.SerializerManager@55a990cc): 4.2 GB
20/01/30 08:55:28 INFO TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@785b4080: 65.0 MB
20/01/30 08:55:28 INFO TaskMemoryManager: 0 bytes of memory were used by task 15249 but are not associated with specific consumers
20/01/30 08:55:28 INFO TaskMemoryManager: 4643196305 bytes of memory are used for execution and 608596591 bytes of memory are used for storage
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedDirectMemory, value=50331648
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedHeapMemory, value=50331648
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedDirectMemory, value=50331648
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedHeapMemory, value=50331648
2

现在,它需要大约3个小时才能完成工作。任何提高性能的方法

1 个答案:

答案 0 :(得分:1)

使用dataframe方法将hdfs写入partitionBy时,我注意到了相同的行为。后来我发现我应该在in-memory partitioning之前申请disk-partitioning

因此,首先repartitiondataframe放在您想在partitionBy中使用的同一列上,如下所示

df2=df1.repartition($"year",$"month",$"day")
df2.repartition(3).mode("overwrite").partitionBy("year","month","day").save("path to hdfs")