一个简单的代码大约需要130秒才能写入s3-minio,而写入本地磁盘仅需要1秒。有什么问题吗?
我关注了这篇文章,但无济于事 https://docs.min.io/docs/disaggregated-spark-and-hadoop-hive-with-minio.html
使用3个执行程序运行会更快-52秒,但仍然不够快
master('local [32]')可以达到21秒
master('local [1]')-> 130秒
环境:
单节点kubernete集群在本地计算机(16核/ 32G)上运行, 一个s3-minio POD(以本地磁盘作为存储),Spark驱动程序POD和一些Spark执行程序POD。
iotop显示minio和spark之间的净流量约为100Kb〜1Mb。 cpu也很低。minio中的goroutine数量约为150〜450(最大)
请参阅下面的日志,我发现有很多API调用来检索s3对象状态。是原因吗?
2020-01-05 03:00:42,674 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem - object_delete_requests += 1 -> 24456
2020-01-05 03:00:42,676 DEBUG org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol - Committing files staged for absolute locations Map()
2020-01-05 03:00:42,676 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem - op_get_file_status += 1 -> 61698
2020-01-05 03:00:42,676 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem - Getting path status for s3a://dataplatform/tmp/test_pp_60m/.spark-staging-466619ae-8b30-4be3-9c92-49e079bd449c (tmp/test_pp_60m/.spark-staging-466619ae-8b30-4be3-9c92-49e079bd449c)
2020-01-05 03:00:42,676 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem - object_metadata_requests += 1 -> 141711
2020-01-05 03:00:42,677 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem - object_metadata_requests += 1 -> 141712
2020-01-05 03:00:42,677 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem - object_list_requests += 1 -> 55793
2020-01-05 03:00:42,678 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem - Not Found: s3a://dataplatform/tmp/test_pp_60m/.spark-staging-466619ae-8b30-4be3-9c92-49e079bd449c
2020-01-05 03:00:42,678 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem - Couldn't delete s3a://dataplatform/tmp/test_pp_60m/.spark-staging-466619ae-8b30-4be3-9c92-49e079bd449c - does not exist
2020-01-05 03:00:42,678 INFO org.apache.spark.sql.execution.datasources.FileFormatWriter - Write Job 1a68dddd-fd88-49cd-957d-36e050d31de3 committed.
2020-01-05 03:00:42,679 INFO org.apache.spark.sql.execution.datasources.FileFormatWriter - Finished processing stats for write job 1a68dddd-fd88-49cd-957d-36e050d31de3.
2020-01-05 03:08:59,183 DEBUG org.apache.spark.broadcast.TorrentBroadcast - Unpersisting TorrentBroadcast 1
2020-01-05 03:08:59,184 DEBUG org.apache.spark.storage.BlockManagerSlaveEndpoint - removing broadcast 1
from pyspark.sql import Row
import random, time
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.enableHiveSupport() \
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)\
.master("local[1]") \
.getOrCreate()
fixed_date = ['2019-01-01','2019-01-02','2019-01-03','2019-01-04']
refs = ['0','1','2']
data = bytearray(random.getrandbits(8) for _ in range(100))
start=int(time.time())
print("start=%s"%start)
rows = []
for ref_id in refs:
for d in fixed_date:
for camera_id in range(1):
for c in range(1000):
rows.append(Row(ref_id=ref_id,
camera_id="c_"+str(camera_id),
date=d,
data=data
))
df = spark._sc.parallelize(rows).toDF()
print("partition number=%s, row size=%s"% (df.rdd.getNumPartitions(),len(rows)))
df.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
结果更新
我认为hadoop s3的速度很慢(无论我使用快速上载还是普通的s3传输管理器),尤其是当我在S3中写入太多文件时,每个文件的API调用费用约为80-100。 EMR或alluxio会有所帮助吗?)
答案 0 :(得分:0)
为了使实木复合地板能够拾取您需要的新提交者
#2上的文档位于https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/bk_cloud-data-access/content/ch03s08s05.html
中项目#1位于Spark行李箱中;我不认为它在任何发布的ASF版本中。如果尝试使用它们,则在HDP03.0 / 3.1 spark二进制文件中。
另外,要求较小的块大小
fs.s3a.block.size=64M
fs.s3a.multipart.size=64M
fs.s3a.multipart.threshold=64M