将Zeppelin Pyspark连接到Redshift的JDBC错误

时间:2019-02-22 18:25:05

标签: apache-spark pyspark amazon-redshift amazon-emr apache-zeppelin

我正在使用: 电子病历5.20 齐柏林飞艇0.8.0 Spark 2.4.0

我能够添加Redshift解释器,但无法将数据提取到pyspark数据框中。我要做的就是将redshift表复制到spark sql数据帧中。

我用wget将redshift驱动程序放入齐柏林飞艇库中。

wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC42-no-awssdk-1.2.20.1043.jar

然后我可以使用添加了解释器的以下代码行查询数据库。

%Redshift
select * from public.debtors

但是,我不能在spark中使用驱动程序来提取数据。如果有更简单的方法,请告诉我。我从头开始运行此程序,因为首先需要z.load。

%dep 
z.load("/usr/lib/zeppelin/lib/RedshiftJDBC42-no-awssdk-1.2.20.1043.jar")


%pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SQLContext
from pyspark.sql import DataFrameReader
sql_context = SQLContext(sc)

import os 
print(os.chdir('/usr/lib/zeppelin/lib/'))
print(os.getcwd())

def redshift_to_spark(sql_context, user, host, port, database, redshift_password, sql_query, spark_table, partition_count=100):
    url = 'jdbc:postgres://{host}:{port}/{database}'.format(
        host=host,
        port=port,
        database=database
    )
    properties = {'user': user, 'password': redshift_password, 'driver': 'org.postgresql.Driver'} 
    data_frame = DataFrameReader(sql_context).jdbc(
        url=url, table=sql_query, properties=properties, numPartitions=partition_count
    )
    data_frame.registerTempTable(spark_table)
    return data_frame

redshift_dates_sql = "select * from dates"

dates = redshift_to_spark(
    sql_query=redshift_dates_sql,
    sql_context=sql_context,
    user="******",
    host="******",
    port=******,
    database='dev',
    redshift_password=******,
    spark_table='dates'
)

这已经通过终端为我的同事工作了。但是,在Zeppelin中运行时遇到了问题。我收到以下错误:

Py4JJavaError: An error occurred while calling o249.jdbc.
: java.lang.NullPointerException
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:238)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling o249.jdbc.\n', JavaObject id=o251), <traceback object at 0x7fb781d01200>)

我愿意尝试几乎所有的东西。让我知道您是否有任何想法。

0 个答案:

没有答案