Spark SQL,Postgres和DataFrames的问题

时间:2017-06-20 17:38:56

标签: postgresql scala apache-spark apache-spark-sql

我正在尝试通过spark sql在postgress中提取记录,并使用该数据创建一个RDD。

到目前为止,我编写了查询postgres的代码,将表Ifrom postgres加载到SQL上下文中,但无法确认数据是否真的存在,也无法确认它的形式。

下面是我正在使用的代码,但似乎我无法调用数据帧,因为它会抛出空指针异常。如果我将其转换为RDD并尝试查看RDD中的第一行,则相同。

有谁知道我做错了什么?我想确认行有我期望的数据。我在数据框上调用count并返回一些行,我假设是我拉的记录数量,我得到的col数与我的预期相符,但我似乎无法弄清楚如何查看行以确认记录在数据框中。

数据看起来像这样,在postgres中:

[Patch_id:string,number:bigint,products:array,type:string,restart_required:string,superseded:array]

[ RECORD 1 ]
patch_id            | PATCH-2013-3862
number              | 2872339
products            | 
type                | Vendor Fix
restart_required    | 
superseded          |

[ RECORD 2 ]
patch_id            | PATCH-2015-2368
number              | 3072631
products            | 
type                | Vendor Fix
restart_required    | 
superseded          | 

代码编写:

val sqlContext = new SQLContext(context)

val df = sqlContext.read.format("jdbc")
  .option("url", config.get(CONFIG_VAR_CONNECTION_STRING))
  .option("dbtable", "(SELECT * FROM patches) patches")
  .option("driver", "org.postgresql.Driver")
  .load()

logger.error("Count: " + df.count())
logger.error("Column Count: " + df.columns.length)
logger.error("First: " + df.first()) // Fails here, even fails with .show()

val patchInfoRDD = df.rdd.map(_.mkString(","))

logger.error("First: " + patchInfoRDD.first()) // Fails here
  

引起:java.lang.NullPointerException       在org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD $$ anon $ 1.getNext(JDBCRDD.scala:422)       在org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD $$ anon $ 1.hasNext(JDBCRDD.scala:498)       在scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:327)       在scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:327)       在scala.collection.Iterator $$ anon $ 10.hasNext(Iterator.scala:308)       在scala.collection.Iterator $ class.foreach(Iterator.scala:727)       在scala.collection.AbstractIterator.foreach(Iterator.scala:1157)       在scala.collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:48)       在scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:103)       在scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:47)       在scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:273)       在scala.collection.AbstractIterator.to(Iterator.scala:1157)       在scala.collection.TraversableOnce $ class.toBuffer(TraversableOnce.scala:265)       在scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)       在scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:252)       在scala.collection.AbstractIterator.toArray(Iterator.scala:1157)       在org.apache.spark.rdd.RDD $$ anonfun $ take $ 1 $$ anonfun $ 28.apply(RDD.scala:1328)       在org.apache.spark.rdd.RDD $$ anonfun $ take $ 1 $$ anonfun $ 28.apply(RDD.scala:1328)       在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1858)       在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1858)       在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)       在org.apache.spark.scheduler.Task.run(Task.scala:89)       在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:214)       在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)       at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)       在java.lang.Thread.run(Thread.java:745)

0 个答案:

没有答案