我有一个正在运行作业的GCP dataproc集群。作业的输入是一个文件夹,其中有200个零件文件。每个零件文件约为1.2 GB。
我的工作就是地图操作
val df = spark.read.parquet("gs://bucket/data/src/....")
df.withColumn("a", lit("b")).write.save("gs://bucket/data/dest/...")
属性parquet.block.size
设置为128 MB
,这意味着每个零件文件在作业期间将被读取10次。
我启用了存储桶访问日志记录,并查看了统计信息,我很惊讶地看到每个零件文件的访问次数都高达 85次。我可以看到只有10个请求发送实际数据,其他请求要么返回0字节,要么发送很小的字节。
我确实知道,拆分时读取大型拼花地板文件是标准的Spark行为。同样也必须有一些元数据交换请求,但是 8X 调用是很奇怪的。另外,如果我看看传输的数据量和所花费的时间,看起来数据正在以100 MB / mins的速度传输,这对于Google内部数据传输(从GCS到dataproc)来说非常慢。我将一个字节的附件,时间,URL附加到一个CSV文件中。
有人用dataproc经历过这种行为吗?是否有这么多文件请求和如此慢的传输速率的解释?
请注意,存储桶和dataproc集群都在同一区域。有50名工人使用n1-standard-16机器。
由于我无法附加文件,因此我在此处粘贴了格式化的内容。
| sc_bytes | time_taken_micros | cs_uri |
|-----------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| 0 | 21000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 22000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 709922 | 164000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 709922 | 86000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 709922 | 173000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 18000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 8 | 47000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 18000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 8 | 51000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 18000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 12000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 8 | 103000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 709922 | 98000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 8 | 42000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 18000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 18000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 709922 | 88000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 8 | 42000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 18000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 18000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 20000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 20000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 8 | 40000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 20000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 15000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 19000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 18000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 143092175 | 63484000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 16000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 19000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 32000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 16000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 19000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 18000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 14000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 137585202 | 66010000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 136726977 | 66732000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 176684024 | 101921000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 32000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 709922 | 113000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 16000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 23000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 16000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 19000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 134187229 | 64401000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 135450987 | 73632000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 24000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 21000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 15000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 15000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 27000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 709922 | 106000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 137020002 | 66333000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 17000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 24000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 8 | 41000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 25000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 16000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 8 | 39000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 20000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 709922 | 135000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 16000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 0 | 19000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 709922 | 126000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 8 | 41000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 0 | 18000 | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet |
| 135686216 | 71676000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
| 179573683 | 90877000 | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media |
答案 0 :(得分:2)
在这种情况下,预计会有相对大量的GCS元数据请求(表中不含?alt=media
参数的URL)。作业驱动程序执行元数据请求以列出文件并获取其大小以生成拆分,然后为每个拆分工作人员执行多个元数据请求以检查文件是否存在,获取其大小等。我认为这种看似无效的原因是由于Spark使用HDFS interface来访问GCS,并且因为HDFS请求的延迟比GCS低得多,所以我不认为整个Hadoop / Spark堆栈都经过了优化以减少HDFS请求的数量。
要解决此问题,在Spark级别上,您可能希望使用spark.sql.parquet.cacheMetadata=true
属性启用元数据缓存。
在GCS连接器级别上,要减少GCS元数据请求的数量,可以启用具有fs.gs.performance.cache.enable=true
属性(Spark具有metadata cache前缀)的spark.hadoop.
,但是它可能会导致元数据过时。
此外,要利用GCS连接器的最新改进(包括减少的GCS元数据请求数量和对random reads的支持),您可能想在集群中update it到latest version或使用预先安装的Dataproc 1.3。
关于读取速度,您可能希望为每个VM分配更多的工作任务,这将通过增加同时读取的次数来提高读取速度。
此外,您可能想检查读速度是否受工作负载的写速度的限制,方法是完全删除对GCS的写操作,或者将其替换为对HDFS的写操作或某些计算。