我在Windows机器上安装了Spark,并希望通过Spyder使用它。经过一些故障排除后,基础知识似乎有效:
import os
os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
textFile = sc.textFile("D:\\Analytics\\Spark\\spark-1.4.0-bin-hadoop2.6\\README.md")
textFile.count()
textFile.filter(lambda line: "Spark" in line).count()
sc.stop()
这按预期运行。我现在想要连接到在同一服务器上运行的Postgres9.3数据库。我从这里here下载了JDBC驱动程序,并将其放在D:\ Analytics \ Spark \ spark_jars文件夹中。然后我创建了一个包含以下行的新文件D:\ Analytics \ Spark \ spark-1.4.0-bin-hadoop2.6 \ conf \ spark-defaults.conf:
spark.driver.extraClassPath 'D:\\Analytics\\Spark\\spark_jars\\postgresql-9.3-1103.jdbc41.jar'
我运行了以下代码来测试连接
import os
os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df = (sqlContext
.load(source="jdbc",
url="jdbc:postgresql://[hostname]/[database]?user=[username]&password=[password]",
dbtable="pubs")
)
sc.stop()
但是我收到以下错误:
Py4JJavaError: An error occurred while calling o22.load.
: java.sql.SQLException: No suitable driver found for jdbc:postgresql://uklonana01/stonegate?user=analytics&password=pMOe8jyd
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:128)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:113)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
如何检查我是否已下载正确的.jar文件或错误可能来自哪里?
答案 0 :(得分:2)
我尝试过SPARK_CLASSPATH环境变量,但它不适用于Spark 1.6。
来自以下帖子的其他答案建议添加pyspark命令参数,它可以正常工作。
Not able to connect to postgres using jdbc in pyspark shell
Apache Spark : JDBC connection not working
pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
答案 1 :(得分:1)
删除spark-defaults.conf并将SPARK_CLASSPATH添加到python中的系统环境中,如下所示:
os.environ["SPARK_CLASSPATH"] = 'PATH\\TO\\postgresql-9.3-1101.jdbc41.jar'
答案 2 :(得分:0)
将pyspark与您的postrgresql数据库连接的另一种方法。
1)使用pip安装spark:pip install pyspark
2)在以下位置下载最新版本的jdbc postgresql连接器: https://jdbc.postgresql.org/download.html
3)使用您的数据库凭据完成此代码:
from __future__ import print_function
from pyspark.sql import SparkSession
def jdbc_dataset_example(spark):
df = spark.read \
.jdbc("jdbc:postgresql://[your_db_host]:[your_db_port]/[your_db_name]",
"com_dim_city",
properties={"user": "[your_user]", "password": "[your_password]"})
df.createOrReplaceTempView("[your_table]")
sqlDF = spark.sql("SELECT * FROM [your_table] LIMIT 10")
sqlDF.show()
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.getOrCreate()
jdbc_dataset_example(spark)
spark.stop()
最后用以下方法来应用:
spark-submit --driver-class-path /path/to/your_jdbc_jar/postgresql-42.2.6.jar --jars postgresql-42.2.6.jar /path/to/your_jdbc_jar/test_pyspark_to_postgresql.py