决定实木复合地板文件缓冲区大小的因素

时间:2018-11-13 17:27:17

标签: apache-spark hadoop hdfs parquet

我在spark-shell中将一个DataFrame写入hdfs,并得到以下输出。我想了解的是,什么决定了要写入的镶木地板文件的大小?我的dfs.block.size设置为:

scala> spark.sparkContext.hadoopConfiguration.get("dfs.block.size")
res1: String = 134217728

128 MB,为什么我的文件在20,000,000字节范围内?

-rw-r--r--   1 hadoop supergroup          0 2018-11-13 11:51 /new_sample_parquet_test/_SUCCESS
-rw-r--r--   1 hadoop supergroup   23631191 2018-11-13 11:51 /new_sample_parquet_test/part-00000-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23435545 2018-11-13 11:51 /new_sample_parquet_test/part-00001-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22568091 2018-11-13 11:51 /new_sample_parquet_test/part-00002-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23385544 2018-11-13 11:51 /new_sample_parquet_test/part-00003-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23335676 2018-11-13 11:51 /new_sample_parquet_test/part-00004-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23423372 2018-11-13 11:51 /new_sample_parquet_test/part-00005-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22182760 2018-11-13 11:51 /new_sample_parquet_test/part-00006-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20906453 2018-11-13 11:51 /new_sample_parquet_test/part-00007-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22918107 2018-11-13 11:51 /new_sample_parquet_test/part-00008-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21655224 2018-11-13 11:51 /new_sample_parquet_test/part-00009-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20366872 2018-11-13 11:51 /new_sample_parquet_test/part-00010-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22658141 2018-11-13 11:51 /new_sample_parquet_test/part-00011-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22246580 2018-11-13 11:51 /new_sample_parquet_test/part-00012-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20648612 2018-11-13 11:51 /new_sample_parquet_test/part-00013-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22369663 2018-11-13 11:51 /new_sample_parquet_test/part-00014-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23396027 2018-11-13 11:51 /new_sample_parquet_test/part-00015-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23382811 2018-11-13 11:51 /new_sample_parquet_test/part-00016-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   17470540 2018-11-13 11:51 /new_sample_parquet_test/part-00017-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22669018 2018-11-13 11:51 /new_sample_parquet_test/part-00018-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21899425 2018-11-13 11:51 /new_sample_parquet_test/part-00019-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21378060 2018-11-13 11:51 /new_sample_parquet_test/part-00020-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21578176 2018-11-13 11:51 /new_sample_parquet_test/part-00021-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21079291 2018-11-13 11:51 /new_sample_parquet_test/part-00022-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21526313 2018-11-13 11:51 /new_sample_parquet_test/part-00023-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22446489 2018-11-13 11:51 /new_sample_parquet_test/part-00024-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21770955 2018-11-13 11:51 /new_sample_parquet_test/part-00025-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23199003 2018-11-13 11:51 /new_sample_parquet_test/part-00026-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21833916 2018-11-13 11:51 /new_sample_parquet_test/part-00027-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   25090443 2018-11-13 11:51 /new_sample_parquet_test/part-00028-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20725755 2018-11-13 11:51 /new_sample_parquet_test/part-00029-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20666565 2018-11-13 11:51 /new_sample_parquet_test/part-00030-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22299474 2018-11-13 11:51 /new_sample_parquet_test/part-00031-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22327133 2018-11-13 11:51 /new_sample_parquet_test/part-00032-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22207468 2018-11-13 11:51 /new_sample_parquet_test/part-00033-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22630251 2018-11-13 11:51 /new_sample_parquet_test/part-00034-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21648270 2018-11-13 11:51 /new_sample_parquet_test/part-00035-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22230127 2018-11-13 11:51 /new_sample_parquet_test/part-00036-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22439910 2018-11-13 11:51 /new_sample_parquet_test/part-00037-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22252551 2018-11-13 11:51 /new_sample_parquet_test/part-00038-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22160655 2018-11-13 11:51 /new_sample_parquet_test/part-00039-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   17637580 2018-11-13 11:51 /new_sample_parquet_test/part-00040-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21743969 2018-11-13 11:51 /new_sample_parquet_test/part-00041-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22071235 2018-11-13 11:51 /new_sample_parquet_test/part-00042-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21854771 2018-11-13 11:51 /new_sample_parquet_test/part-00043-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   25243330 2018-11-13 11:51 /new_sample_parquet_test/part-00044-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22297865 2018-11-13 11:51 /new_sample_parquet_test/part-00045-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22070057 2018-11-13 11:51 /new_sample_parquet_test/part-00046-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22018671 2018-11-13 11:51 /new_sample_parquet_test/part-00047-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21796749 2018-11-13 11:51 /new_sample_parquet_test/part-00048-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22597634 2018-11-13 11:51 /new_sample_parquet_test/part-00049-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20728588 2018-11-13 11:51 /new_sample_parquet_test/part-00050-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22137701 2018-11-13 11:51 /new_sample_parquet_test/part-00051-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22387635 2018-11-13 11:51 /new_sample_parquet_test/part-00052-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20965957 2018-11-13 11:51 /new_sample_parquet_test/part-00053-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20314451 2018-11-13 11:51 /new_sample_parquet_test/part-00054-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22538965 2018-11-13 11:51 /new_sample_parquet_test/part-00055-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20923261 2018-11-13 11:51 /new_sample_parquet_test/part-00056-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20984805 2018-11-13 11:51 /new_sample_parquet_test/part-00057-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20513317 2018-11-13 11:51 /new_sample_parquet_test/part-00058-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   25493903 2018-11-13 11:51 /new_sample_parquet_test/part-00059-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21178862 2018-11-13 11:51 /new_sample_parquet_test/part-00060-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20696540 2018-11-13 11:51 /new_sample_parquet_test/part-00061-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21011416 2018-11-13 11:51 /new_sample_parquet_test/part-00062-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   15752503 2018-11-13 11:51 /new_sample_parquet_test/part-00063-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet

1 个答案:

答案 0 :(得分:2)

实木复合地板作家不关心HDFS块的大小,因为您可以保存实木复合地板,例如在本地硬盘上。决定单个part-*。parquet文件的数量和大小的因素是数据框中的分区数(在您的情况下为64)。如果您要进行df.coalesce(1).write.parquet(...),则只有一个很大的零件文件。

如果您希望零件文件各自约为128 Mb,则合并参数应约为20 * 64/128 =10。尽管如此,对于给定数目的合并分区依赖性,零件文件的大小并不是严格线性的。零件文件的数量越少,编码/压缩的效率越高。

有关详细信息,请参见coalesce方法说明