Question

我能够通过使用python JDBC数据框连接到Spark中的Hana数据库，并且能够获取dataframe.printSchema（）输出但是如果我尝试像dataframe.show（）这样的操作抛出错误就像连接不可序列化我们怎么做在pyspark中使连接可序列化下面是使用的代码

from pyspark.sql import SQLContext

from pyspark import SparkContext

sc = SparkContext(appName="hdfspush")

sqlctx = SQLContext(sc)

df = sqlctx.read.format('jdbc').options(driver='com.sap.db.jdbc.Driver',url=urlname,dbtable='abcd').load()

df.printSchema()

df.show()

以下是错误消息

Traceback (most recent call last):
  File "C:/spark-1.4.1-bin-hadoop2.6/bin/testhana4.py", line 15, in <module>
    df.show()
  File "C:\spark-1.4.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\sql\dataframe.py", line 258, in show
  File "C:\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
  File "C:\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o25.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: com.sap.db.jdbc.topology.Host
Serialization stack:
        - object not serializable (class: com.sap.db.jdbc.topology.Host, value: saphdbdev03.isus.emc.com:30415)
        - writeObject data (class: java.util.ArrayList)
        - object (class java.util.ArrayList, [saphdbdev03:30415])
        - writeObject data (class: java.util.Hashtable)
        - object (class java.util.Properties, {dburl=jdbc:sap://saphdbdev03:30415, user=SAPSR3, password=*****, url=jdbc:sap://saphdbdev03.isus.emc.com:30415?user=SAPSR3&password=******,
        - field (class: org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1, name: properties$1, type: class java.util.Properties)
        - object (class org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1, <function0>)
        - field (class: org.apache.spark.sql.jdbc.JDBCRDD, name: org$apache$spark$sql$jdbc$JDBCRDD$$getConnection, type: interface scala.Function0)
        - object (class org.apache.spark.sql.jdbc.JDBCRDD, JDBCRDD[0] at showString at NativeMethodAccessorImpl.java:-2)
        - field (class: org.apache.spark.NarrowDependency, name: _rdd, type: class org.apache.spark.rdd.RDD)
        - object (class org.apache.spark.OneToOneDependency, org.apache.spark.OneToOneDependency@2e079958)
        - writeObject data (class: scala.collection.immutable.$colon$colon)
        - object (class scala.collection.immutable.$colon$colon, List(org.apache.spark.OneToOneDependency@2e079958))
        - field (class: org.apache.spark.rdd.RDD, name: org$apache$spark$rdd$RDD$$dependencies_, type: interface scala.collection.Seq)
        - object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[1] at showString at NativeMethodAccessorImpl.java:-2)
        - field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
        - object (class scala.Tuple2, (MapPartitionsRDD[1] at showString at NativeMethodAccessorImpl.java:-2,<function2>))
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:878)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
        at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1426)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

15/12/11 14:21:34 INFO spark.SparkContext: Invoking stop() from shutdown hook
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
15/12/11 14:21:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
15/12/11 14:21:35 INFO ui.SparkUI: Stopped Spark web UI at http://10.30.117.16:4040
15/12/11 14:21:35 INFO scheduler.DAGScheduler: Stopping DAGScheduler

Answer 1

通过编写新的scala程序来处理连接属性中的序列化来解决这个问题

Pyspark JDBC Hana数据库dataframe.show错误

1 个答案: