我正在使用: 电子病历5.20 齐柏林飞艇0.8.0 Spark 2.4.0
我能够添加Redshift解释器,但无法将数据提取到pyspark数据框中。我要做的就是将redshift表复制到spark sql数据帧中。
我用wget将redshift驱动程序放入齐柏林飞艇库中。
wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC42-no-awssdk-1.2.20.1043.jar
然后我可以使用添加了解释器的以下代码行查询数据库。
%Redshift
select * from public.debtors
但是,我不能在spark中使用驱动程序来提取数据。如果有更简单的方法,请告诉我。我从头开始运行此程序,因为首先需要z.load。
%dep
z.load("/usr/lib/zeppelin/lib/RedshiftJDBC42-no-awssdk-1.2.20.1043.jar")
%pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SQLContext
from pyspark.sql import DataFrameReader
sql_context = SQLContext(sc)
import os
print(os.chdir('/usr/lib/zeppelin/lib/'))
print(os.getcwd())
def redshift_to_spark(sql_context, user, host, port, database, redshift_password, sql_query, spark_table, partition_count=100):
url = 'jdbc:postgres://{host}:{port}/{database}'.format(
host=host,
port=port,
database=database
)
properties = {'user': user, 'password': redshift_password, 'driver': 'org.postgresql.Driver'}
data_frame = DataFrameReader(sql_context).jdbc(
url=url, table=sql_query, properties=properties, numPartitions=partition_count
)
data_frame.registerTempTable(spark_table)
return data_frame
redshift_dates_sql = "select * from dates"
dates = redshift_to_spark(
sql_query=redshift_dates_sql,
sql_context=sql_context,
user="******",
host="******",
port=******,
database='dev',
redshift_password=******,
spark_table='dates'
)
这已经通过终端为我的同事工作了。但是,在Zeppelin中运行时遇到了问题。我收到以下错误:
Py4JJavaError: An error occurred while calling o249.jdbc.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:238)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling o249.jdbc.\n', JavaObject id=o251), <traceback object at 0x7fb781d01200>)
我愿意尝试几乎所有的东西。让我知道您是否有任何想法。