使用pyspark连接Microsoft SQL Server,抛出错误:

时间:2016-10-17 06:38:34

标签: pyspark apache-spark-sql spark-dataframe pyspark-sql

请指导我使用Pyspark连接和读取MS SQL数据的步骤。 下面是我的代码和我尝试从MS SQL Server加载数据时收到的错误消息。请指导我。

import urllib
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext

from pyspark.sql import SQLContext

APP_NAME = 'My Spark Application'

conf = SparkConf().setAppName("APP_NAME").setMaster("local[4]")
sc = SparkContext(conf=conf)

sqlcontext = SQLContext(sc)

jdbcDF = sqlcontext.read.format("jdbc").option("url", "jdbc:sqlserver:XXXX:1433").option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable", "dbo.XXXX").option("user", "XXXX").option("password", "XXX").load() 

********* ERROR **************** ***********************     teway.py",第1133行,通话         回答,self.gateway_client,self.target_id,self.name)       文件" C:\ spark-2.0.1-bin-hadoop2.6 \ python \ pyspark \ sql \ utils.py",第63行,d     生态         返回f(* a,** kw)       文件" C:\ spark-2.0.1-bin-hadoop2.6 \ python \ lib \ py4j-0.10.3-src.zip \ py4j \ protoco     l.py",第319行,在get_return_value中         格式(target_id,"。",name),value)     py4j.protocol.Py4JJavaError:调用o66.load时发生错误。     :java.lang.NullPointerException             在org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD $ .resolveTable     (JDBCRDD.scala:167)             在org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation。(J     DBCRelation.scala:117)             在org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider。     createRelation(JdbcRelationProvider.scala:53)             在org.apache.spark.sql.execution.datasources.DataSource.resolveRelation     (DataSource.scala:330)             在org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)             在org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)             at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl。     Java的:57)             at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces     sorImpl.java:43)             at java.lang.reflect.Method.invoke(Method.java:606)             at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)             在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)             在py4j.Gateway.invoke(Gateway.java:280)             at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)             在py4j.commands.CallCommand.execute(CallCommand.java:79)             在py4j.GatewayConnection.run(GatewayConnection.java:214)             在java.lang.Thread.run(Thread.java:745)

2 个答案:

答案 0 :(得分:0)

  1. 下载mssql-jdbc-x.x.x.jrex.jar文件(https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view=sql-server-ver15
  2. 运行以下代码:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext

appName = "PySpark SQL Server Example - via JDBC"
master = "local[*]"
conf = SparkConf() \
    .setAppName(appName) \
    .setMaster(master) \
    .set("spark.driver.extraClassPath","path/to/mssql-jdbc-x.x.x.jrex.jar")
sc = SparkContext.getOrCreate(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession

database = "mydatabase"
table = "dbo.mytable"
user = "username"
password  = "password"

jdbcDF = spark.read.format("jdbc") \
    .option("driver" , "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .option("url", f"jdbc:sqlserver://serverip:1433;databaseName={database}") \
    .option("dbtable", "mytable") \
    .option("user", user) \
    .option("password", password) \
    .load()

jdbcDF.show()

答案 1 :(得分:-1)

以下解决方案对我有用:

mssql-jdbc-7.0.0.jre8.jar 文件包含到jar子文件夹中(例如:C:\ spark \ spark-2.2.2-bin-hadoop2.7 \ jars)或者您可以根据您的系统粘贴任何jar文件。

然后使用以下命令连接到MS SQL Server并创建Spark数据框:

dbData = spark.read.jdbc(“ jdbc:sqlserver:// servername; databaseName:ExampleDB; user:username; password:password”,“ tablename”)