我正在尝试使用DeltaLakes
开始使用Pyspark
。
为了能够使用deltalake,我在Anaconda shell-prompt上调用pyspark为-
pyspark — packages io.delta:delta-core_2.11:0.3.0
以下是deltalake的参考文献-https://docs.delta.io/latest/quick-start.html
所有有关delta lake的命令都可以在Anaconda shell提示符下正常运行。
在jupyter Notebook上,对deltalake表的引用给出了错误。这是我在Jupyter Notebook上运行的代码-
df_advisorMetrics.write.mode("overwrite").format("delta").save("/DeltaLake/METRICS_F_DELTA")
spark.sql("create table METRICS_F_DELTA using delta location '/DeltaLake/METRICS_F_DELTA'")
下面是我在笔记本启动时用于连接pyspark的代码-
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()
以下是我得到的错误:
Py4JJavaError:调用o116.save时发生错误。 :java.lang.ClassNotFoundException:无法找到数据源:增量。请在http://spark.apache.org/third-party-projects.html
中找到软件包
有什么建议吗?
答案 0 :(得分:0)
一个可能的解决方案是遵循Import PySpark packages with a regular Jupyter notebook中提到的技术。
另一种可能的解决方案是下载增量核心JAR并将其放置在<script src="//unpkg.com/vue/dist/vue.js"></script>
<script src="//unpkg.com/element-ui@2.12.0/lib/index.js"></script>
<div id="app">
<template>
<div>
<el-popover
ref="popover"
placement="right"
title="Title"
width="200"
trigger="hover"
content="this is content, this is content, this is content">
</el-popover>
<el-button v-popover:popover>Focus to activate</el-button>
</div>
<br/><br/>
<div>
<el-popover
ref="popover2"
placement="right"
title="Title"
width="200"
trigger="hover"
content="this is content, this is content, this is content">
</el-popover>
<el-button v-popover:popover2>Focus to activate</el-button>
</div>
</template>
</div>
文件夹中,因此当您运行$SPARK_HOME/jars
时,它会自动包含Delta Lake JAR。
答案 1 :(得分:0)
我创建了一个Google Colab / Jupyter Notebook示例,展示了如何运行Delta Lake。
https://github.com/prasannakumar2012/spark_experiments/blob/master/examples/Delta_Lake.ipynb
它具有运行所需的所有步骤。这使用了最新的spark和delta版本。请相应地更改版本。
答案 2 :(得分:-1)
我一直在Jupyter笔记本上使用DeltaLake。
在运行Python 3.x的Jupyter笔记本中尝试以下操作。
### import Spark libraries
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
### spark package maven coordinates - in case you are loading more than just delta
spark_packages_list = [
'io.delta:delta-core_2.11:0.6.1',
]
spark_packages = ",".join(spark_packages_list)
### SparkSession
spark = (
SparkSession.builder
.config("spark.jars.packages", spark_packages)
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
)
sc = spark.sparkContext
### Python library in delta jar.
### Must create sparkSession before import
from delta.tables import *
df
### overwrite, change mode="append" if you prefer
(df.write.format("delta")
.save("my_delta_file", mode="overwrite", partitionBy="partition_column_name")
)
df_delta = spark.read.format("delta").load("my_delta_file")
### Spark S3 access
hdpConf = sc._jsc.hadoopConfiguration()
user = os.getenv("USER")
### Assuming you have your AWS credentials in a jceks keystore.
hdpConf.set("hadoop.security.credential.provider.path", f"jceks://hdfs/user/{user}/awskeyfile.jceks")
hdpConf.set("fs.s3a.fast.upload", "true")
### optimize s3 bucket-level parquet column selection
### un-comment to use
# hdpConf.set("fs.s3a.experimental.fadvise", "random")
### Pick one upload buffer option
hdpConf.set("fs.s3a.fast.upload.buffer", "bytebuffer") # JVM off-heap memory
# hdpConf.set("fs.s3a.fast.upload.buffer", "array") # JVM on-heap memory
# hdpConf.set("fs.s3a.fast.upload.buffer", "disk") # DEFAULT - directories listed in fs.s3a.buffer.dir
s3_bucket_path = "s3a://your-bucket-name"
s3_delta_prefix = "delta" # or whatever
### overwrite, change mode="append" if you prefer
(df.write.format("delta")
.save(f"{s3_bucket_path}/{s3_delta_prefix}/", mode="overwrite", partitionBy="partition_column_name")
)
df_delta = spark.read.format("delta").load(f"{s3_bucket_path}/{s3_delta_prefix}/")
不直接回答原始问题,但是为了完整起见,您还可以执行以下操作。
spark-defaults.conf
文件中spark.jars.packages io.delta:delta-core_2.11:0.6.1
spark.delta.logStore.class org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark-submit \
--properties-file /path/to/your/spark-defaults.conf \
--name your_spark_delta_app \
--py-files /path/to/your/supporting_pyspark_files.zip \
--class Main /path/to/your/pyspark_script.py