我使用spark 2.1流处理来自Kafka的事件数据。汇总数据后,我想用存储在HDFS(实木复合地板文件)中的参考数据来丰富它们。
驱动程序代码如下。
val ss: SparkSession = SparkSession.builder()
.appName("app name").master("local[2]")
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate()
val sc = ss.sparkContext
val ssc = new StreamingContext(sc, 5) )
//read the data from Kafka here .....
SomeService.aggregate(kafkaInputStream).foreachRDD(rdd => {
val df =ss.read.parquet( filePath + "/*.parquet" )
println("Record Count in DF: " + df.count())
rdd.foreachPartition(partition => {
val futures = partition.map(event => {
sentMsgsNo.add(1L)
val eventEnriched = Enrichment(event, df)
kafkaSinkVar.value.sendCef(eventEnriched)
})
// by calling get() on the futures, we make sure to wait for all
// Producers started during this partition
// to finish before moving on.
futures.foreach(f => {
if (f.get() == null) {
failedSentMsgNo.add(1L)
} else {
confirmedSentMsgsNo.add(1L)
}
})
})
})
def enrichment (event: SomeEventType df: DataFrame): String = {
...
try {
df.select(col("id")).first().getString(0)
} catch {
case e: Exception => println("not record found")
}
}
基本上,对于每个RDD,我都将参考数据加载到datafame中,并传递数据框以根据某个ID丰富每个记录。执行代码没有错误,但是永远不会发生填充。问题是df(dataframe)是无效的树。我的代码/逻辑有什么问题?
另一个问题-是这样做的正确方法,基本上,我只想从HDFS的每个分区而不是每个记录中读取数据。
谢谢!