GCP Dataproc-从GCS读取速度慢

时间:2018-11-12 11:00:17

标签: apache-spark google-cloud-platform google-cloud-dataproc

我有一个正在运行作业的GCP dataproc集群。作业的输入是一个文件夹,其中有200个零件文件。每个零件文件约为1.2 GB。

我的工作就是地图操作

val df = spark.read.parquet("gs://bucket/data/src/....")
df.withColumn("a", lit("b")).write.save("gs://bucket/data/dest/...")

属性parquet.block.size设置为128 MB,这意味着每个零件文件在作业期间将被读取10次。

我启用了存储桶访问日志记录,并查看了统计信息,我很惊讶地看到每个零件文件的访问次数都高达 85次。我可以看到只有10个请求发送实际数据,其他请求要么返回0字节,要么发送很小的字节。

我确实知道,拆分时读取大型拼花地板文件是标准的Spark行为。同样也必须有一些元数据交换请求,但是 8X 调用是很奇怪的。另外,如果我看看传输的数据量和所花费的时间,看起来数据正在以100 MB / mins的速度传输,这对于Google内部数据传输(从GCS到dataproc)来说非常慢。我将一个字节的附件,时间,URL附加到一个CSV文件中。

有人用dataproc经历过这种行为吗?是否有这么多文件请求和如此慢的传输速率的解释?

请注意,存储桶和dataproc集群都在同一区域。有50名工人使用n1-standard-16机器。

enter image description here

由于我无法附加文件,因此我在此处粘贴了格式化的内容。

| sc_bytes  | time_taken_micros | cs_uri                                                                                                                                  | 
|-----------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------| 
| 0         | 21000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 22000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 709922    | 164000            | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 709922    | 86000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 709922    | 173000            | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 18000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 8         | 47000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 18000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 8         | 51000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 18000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 12000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 8         | 103000            | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 709922    | 98000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 8         | 42000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 18000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 18000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 709922    | 88000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 8         | 42000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 18000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 18000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 20000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 20000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 8         | 40000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 20000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 15000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 19000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 18000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 143092175 | 63484000          | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 16000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 19000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 32000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 16000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 19000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 18000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 14000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 137585202 | 66010000          | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 136726977 | 66732000          | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 176684024 | 101921000         | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 32000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 709922    | 113000            | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 16000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 23000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 16000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 19000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 134187229 | 64401000          | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 135450987 | 73632000          | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 24000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 21000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 15000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 15000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 27000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 709922    | 106000            | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 137020002 | 66333000          | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 17000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 24000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 8         | 41000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 25000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 16000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 8         | 39000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 20000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 709922    | 135000            | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 16000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 0         | 19000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 709922    | 126000            | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 8         | 41000             | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 0         | 18000             | /storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet                    | 
| 135686216 | 71676000          | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 
| 179573683 | 90877000          | /download/storage/v1/b/spark-ml-tkrt/o/Dfs%2F/massiveDf%2Fpart-00184-35441be5-85ca-4b21-85bd-cb99f9aa3093-c000.snappy.parquet?alt=media | 

1 个答案:

答案 0 :(得分:2)

在这种情况下,预计会有相对大量的GCS元数据请求(表中不含?alt=media参数的URL)。作业驱动程序执行元数据请求以列出文件并获取其大小以生成拆分,然后为每个拆分工作人员执行多个元数据请求以检查文件是否存在,获取其大小等。我认为这种看似无效的原因是由于Spark使用HDFS interface来访问GCS,并且因为HDFS请求的延迟比GCS低得多,所以我不认为整个Hadoop / Spark堆栈都经过了优化以减少HDFS请求的数量。

要解决此问题,在Spark级别上,您可能希望使用spark.sql.parquet.cacheMetadata=true属性启用元数据缓存。

在GCS连接器级别上,要减少GCS元数据请求的数量,可以启用具有fs.gs.performance.cache.enable=true属性(Spark具有metadata cache前缀)的spark.hadoop.,但是它可能会导致元数据过时。

此外,要利用GCS连接器的最新改进(包括减少的GCS元数据请求数量和对random reads的支持),您可能想在集群中update itlatest version或使用预先安装的Dataproc 1.3

关于读取速度,您可能希望为每个VM分配更多的工作任务,这将通过增加同时读取的次数来提高读取速度。

此外,您可能想检查读速度是否受工作负载的写速度的限制,方法是完全删除对GCS的写操作,或者将其替换为对HDFS的写操作或某些计算。