我目前正在尝试将delta-lake
parquet
文件写入S3,并在本地用MinIO替换。
我可以很好地将标准parquet
文件读/写到S3
。
但是,当我使用delta lake example
似乎我无法将delta_log/
写入我的MinIO
。
所以我尝试设置:fs.AbstractFileSystem.s3a.impl
和fs.s3a.impl
。
我正在使用pyspark[sql]==2.4.3
中使用的venv
。
src/.env
:
# pyspark packages
DELTA = io.delta:delta-core_2.11:0.3.0
HADOOP_COMMON = org.apache.hadoop:hadoop-common:2.7.3
HADOOP_AWS = org.apache.hadoop:hadoop-aws:2.7.3
PYSPARK_SUBMIT_ARGS = ${HADOOP_AWS},${HADOOP_COMMON},${DELTA}
src/spark_session.py
:
# configure s3 connection for read/write operation (native spark)
hadoop_conf = sc.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.endpoint", self.aws_endpoint_url)
hadoop_conf.set("fs.s3a.access.key", self.aws_access_key_id)
hadoop_conf.set("fs.s3a.secret.key", self.aws_secret_access_key)
# hadoop_conf.set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") # when using hadoop 2.8.5
# hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") # alternative to above hadoop 2.8.5
hadoop_conf.set("fs.s3a.path.style.access", "true")
hadoop_conf.set("spark.history.fs.logDirectory", 's3a://spark-logs-test/')
src/apps/raw_to_parquet.py
# Trying to write pyspark dataframe to MinIO (S3)
raw_df.coalesce(1).write.format("delta").save(s3_url)
bash
:
# RUN CODE
spark-submit --packages $(PYSPARK_SUBMIT_ARGS) src/run_onlineretailer.py
hadoop-common: 2.7.3
,hadoop-aws: 2.7.3
错误:java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3a.S3AFileSystem.<init>(java.net.URI, org.apache.hadoop.conf.Configuration)
因此,由于这个错误,我然后更新到hadoop-common: 2.8.5
,hadoop-aws: 2.8.5
,以修复NoSuchMethodException
。因为delta
需要:S3AFileSystem
py4j.protocol.Py4JJavaError: An error occurred while calling o89.save.
: java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration
在我看来,parquet
文件的编写似乎没有问题,但是,增量创建了这些delta_log
文件夹,无法识别(我认为?)。
当前source code。
阅读几个不同的类似问题,但似乎没有人尝试使用delta lake
文件。
更新
当前可使用以下设置:
#pyspark packages
DELTA_LOGSTORE = spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
DELTA = io.delta:delta-core_2.11:0.3.0
HADOOP_COMMON = org.apache.hadoop:hadoop-common:2.7.7
HADOOP_AWS = org.apache.hadoop:hadoop-aws:2.7.7
PYSPARK_SUBMIT_ARGS = ${HADOOP_AWS},${HADOOP_COMMON},${DELTA}
PYSPARK_CONF_ARGS = ${DELTA_LOGSTORE}
# configure s3 connection for read/write operation (native spark)
hadoop_conf = sc.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.endpoint", self.aws_endpoint_url)
hadoop_conf.set("fs.s3a.access.key", self.aws_access_key_id)
hadoop_conf.set("fs.s3a.secret.key", self.aws_secret_access_key)
spark-submit --packages $(PYSPARK_SUBMIT_ARGS) --conf $(PYSPARK_CONF_ARGS) src/run_onlineretailer.py
奇怪的是,它只能像这样工作。
如果我尝试使用sc.conf
或hadoop_conf
设置它不起作用,请参见未注释的代码:
def spark_init(self) -> SparkSession:
sc: SparkSession = SparkSession \
.builder \
.appName(self.app_name) \
.config("spark.sql.warehouse.dir", self.warehouse_location) \
.getOrCreate()
# set log level
sc.sparkContext.setLogLevel("WARN")
# Enable Arrow-based columnar data transfers
sc.conf.set("spark.sql.execution.arrow.enabled", "true")
# sc.conf.set("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") # does not work
# configure s3 connection for read/write operation (native spark)
hadoop_conf = sc.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.endpoint", self.aws_endpoint_url)
hadoop_conf.set("fs.s3a.access.key", self.aws_access_key_id)
hadoop_conf.set("fs.s3a.secret.key", self.aws_secret_access_key)
#hadoop_conf.set("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") # does not work
return sc
如果有人可以解释这一点,那就太好了。是因为.getOrCreate()
吗?如果没有此调用,似乎无法设置conf
吗?运行应用程序时,命令行中除外。
答案 0 :(得分:0)
您正在混合hadoop- *罐子;就像火花一样,它们只有在全部来自同一版本时才起作用