尝试在Spark Streaming中使用持久表时出现空指针异常

时间:2015-09-08 12:39:17

标签: apache-spark apache-spark-sql spark-streaming

我在开始时创建“gpsLookUpTable”并持久化它,这样我就不需要反复进行映射了。但是,当我尝试在foreach中访问它时,我得到空指针异常。任何帮助表示赞赏谢谢。

以下是代码段:

def main(args: Array[String]): Unit = { 

val conf = new SparkConf() ... 

val sc = new SparkContext(conf) 
val ssc = new StreamingContext(sc, Seconds(20)) 
val sqc = new SQLContext(sc) 

//////Trying to cache table here to use it below 
val gpsLookUpTable = MapInput.cacheMappingTables(sc, sqc).persist(StorageLevel.MEMORY_AND_DISK_SER_2) 
//sc.broadcast(gpsLookUpTable) 
ssc.textFileStream("hdfs://localhost:9000/inputDirectory/") 
.foreachRDD { rdd => 
if (!rdd.partitions.isEmpty) { 

val allRows = sc.textFile("hdfs://localhost:9000/supportFiles/GeoHashLookUpTable") 
sqc.read.json(allRows).registerTempTable("GeoHashLookUpTable") 
val header = rdd.first().split(",") 
val rowsWithoutHeader = Utils.dropHeader(rdd) 

rowsWithoutHeader.foreach { row => 

val singleRowArray = row.split(",") 
singleRowArray.foreach(println) 
(header, singleRowArray).zipped 
.foreach { (x, y) => 
///Trying to access persisted table but getting null pointer exception 
val selectedRow = gpsLookUpTable 
.filter("geoCode LIKE '" + GeoHash.subString(lattitude, longitude) + "%'") 
.withColumn("Distance", calculateDistance(col("Lat"), col("Lon"))) 
.orderBy("Distance") 
.select("TrackKM", "TrackName").take(1) 
if (selectedRow.length != 0) { 
// do something
} 
else { 
// do something
} 
} 
} }}

1 个答案:

答案 0 :(得分:0)

我假设您正在群集中运行;你的foreach将作为其他节点上的闭包运行。引发Nullpointer是因为该闭包在没有初始化gpsLookUpTable的节点上运行。您显然尝试在

中广播gpsLookUpTable
//sc.broadcast(gpsLookUpTable) 

但是这需要绑定到一个变量,基本上就是这样:

val tableBC = sc.broadcast(gpsLookUpTable) 

在foreach中,你会替换它:

foreach { (x, y) => 
///Trying to access persisted table but getting null pointer exception 
val selectedRow = gpsLookUpTable 

用这个:

foreach { (x, y) => 
///Trying to access persisted table but getting null pointer exception 
val selectedRow = tableBC.value 

有效地让您访问广播值。