将spark数据帧写入postgres数据库

时间:2016-08-08 09:40:13

标签: postgresql apache-spark dataframe apache-spark-sql

火花群设置如下:

conf['SparkConfiguration'] = SparkConf() \
.setMaster('yarn-client') \
.setAppName("test") \
.set("spark.executor.memory", "20g") \
.set("spark.driver.maxResultSize", "20g") \
.set("spark.executor.instances", "20")\
.set("spark.executor.cores", "3") \
.set("spark.memory.fraction", "0.2") \
.set("user", "test_user") \
.set("spark.executor.extraClassPath", "/usr/share/java/postgresql-jdbc3.jar")

当我尝试使用以下代码将数据帧写入Postgres DB时:

from pyspark.sql import DataFrameWriter
my_writer = DataFrameWriter(df)

url_connect = "jdbc:postgresql://198.123.43.24:1234"
table = "test_result"
mode = "overwrite"
properties = {"user":"postgres", "password":"password"}

my_writer.jdbc(url_connect, table, mode, properties)

我遇到以下错误:

Py4JJavaError: An error occurred while calling o1120.jdbc.   
:java.sql.SQLException: No suitable driver
    at java.sql.DriverManager.getDriver(DriverManager.java:278)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:50)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:50)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:49)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:278)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

有人可以就此提出一些建议吗? 谢谢!

3 个答案:

答案 0 :(得分:4)

尝试write.jdbc并传递在write.jdbc()之外单独创建的参数。 同时检查可用于写作的postgres的端口是Postgres 9.6的5432和Postgres 8.4的5433。

mode = "overwrite"
url = "jdbc:postgresql://198.123.43.24:5432/kockpit"
properties = {"user": "postgres","password": "password","driver": "org.postgresql.Driver"}
data.write.jdbc(url=url, table="test_result", mode=mode, properties=properties)

答案 1 :(得分:1)

您是否下载了PostgreSQL JDBC驱动程序?在此处下载:https://jdbc.postgresql.org/download.html

对于pyspark shell,您使用SPARK_CLASSPATH环境变量:

$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark

要通过spark-submit提交脚本,请使用--driver-class-path标志:

$ spark-submit --driver-class-path /path/to/downloaded/jar script.py

答案 2 :(得分:0)

也许您可以尝试显式传递JDBC驱动程序类(请注意,您可能需要将驱动程序jar放在所有spark节点的类路径中):

df.write.option('driver', 'org.postgresql.Driver').jdbc(url_connect, table, mode, properties)
相关问题