在EMR Spark上,JDBC加载失败第一次,然后工作

时间:2017-02-28 17:29:03

标签: hadoop apache-spark spark-dataframe emr elastic-map-reduce

我在AWS Elastic Map Reduce 5.3.1中使用spark-shell与Spark 2.1.0一起从Postgres数据库加载数据。 loader.load总是失败然后成功。为什么会这样?

[hadoop@[SNIP] ~]$ SPARK_PRINT_LAUNCH_COMMAND=1 spark-shell --driver-class-path ~/postgresql-42.0.0.jar 
Spark Command: /etc/alternatives/jre/bin/java -cp /home/hadoop/postgresql-42.0.0.jar:/usr/lib/spark/conf/:/usr/lib/spark/jars/*:/etc/hadoop/conf/ -Dscala.usejavacp=true -Xmx640M -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p org.apache.spark.deploy.SparkSubmit --conf spark.driver.extraClassPath=/home/hadoop/postgresql-42.0.0.jar --class org.apache.spark.repl.Main --name Spark shell spark-shell
========================================
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/28 17:17:52 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/02/28 17:18:56 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://[SNIP]
Spark context available as 'sc' (master = yarn, app id = application_1487878172787_0014).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val loader = spark.read.format("jdbc") // connection options removed
loader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@46067a74

scala> loader.load
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

scala> loader.load
res1: org.apache.spark.sql.DataFrame = [id: int, fsid: string ... 4 more fields]

2 个答案:

答案 0 :(得分:0)

我也遇到了同样的问题。我正在尝试使用JDBC通过Spark连接到vertica。我在用 : 火花壳 Spark版本是2.2.0 java版本是1.8

用于连接的外部罐子: Vertica的-8.1.1_spark2.1_scala2.11-20170623.jar Vertica的-JDBC-8.1.1-0.jar

连接代码:

import java.sql.DriverManager
import com.vertica.jdbc.Driver


val jdbcUsername = "<username>"
val jdbcPassword = "<password>"
val jdbcHostname = "<vertica server>"
val jdbcPort = <vertica port>
val jdbcDatabase ="<vertica DB>"
val jdbcUrl = s"jdbc:vertica://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}"

val connectionProperties = new Properties()
connectionProperties.put("user", jdbcUsername)
connectionProperties.put("password", jdbcPassword )

val connection = DriverManager.getConnection(jdbcUrl, connectionProperties)
java.sql.SQLException: No suitable driver found for jdbc:vertica://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}

  at java.sql.DriverManager.getConnection(Unknown Source)
  at java.sql.DriverManager.getConnection(Unknown Source)
  ... 56 elided

如果我第二次运行相同的命令,我会得到以下输出并建立连接

scala> val connection = DriverManager.getConnection(jdbcUrl, connectionProperties)
connection: java.sql.Connection = com.vertica.jdbc.VerticaJdbc4ConnectionImpl@7d994c

答案 1 :(得分:0)

今天我用PySpark和sqlserver jdbc驱动程序遇到了这个问题。起初我构建了一个简单的解决方法 - 捕获Py4JJavaException并重试,它第二次运行。

诀窍是在DataStreamReader.jdbc方法中指定驱动程序类。

使用pyspark:

spark.read.jdbc(..., properties={'driver': 'com.microsoft.sqlserver.jdbc.SQLServerDriver'})

然后所需要的只是

spark-submit --jars s3://somebucket/sqljdbc42.jar script.py

使用Scala和@ Raje的例子,connectionProperties.put("driver", "...")