利用HDFS中的数据丰富Spark Streaming

时间:2019-05-10 15:25:24

标签: scala apache-spark dataframe apache-spark-sql rdd

我使用spark 2.1流处理来自Kafka的事件数据。汇总数据后,我想用存储在HDFS(实木复合地板文件)中的参考数据来丰富它们。

驱动程序代码如下。

val ss: SparkSession = SparkSession.builder()
    .appName("app name").master("local[2]")
    .config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
    .config("spark.sql.session.timeZone", "UTC")
    .getOrCreate()

val sc = ss.sparkContext 
val ssc = new StreamingContext(sc, 5) )

//read the data from Kafka here .....

SomeService.aggregate(kafkaInputStream).foreachRDD(rdd => {
    val df =ss.read.parquet( filePath + "/*.parquet" )
    println("Record Count in DF: " + df.count())

    rdd.foreachPartition(partition => {
        val futures = partition.map(event => {
            sentMsgsNo.add(1L)
            val eventEnriched = Enrichment(event, df)
            kafkaSinkVar.value.sendCef(eventEnriched)
        })
        // by calling get() on the futures, we make sure to wait for all
        // Producers started during this partition
        // to finish before moving on.
        futures.foreach(f => {
            if (f.get() == null) {
                failedSentMsgNo.add(1L)
            } else {
                confirmedSentMsgsNo.add(1L)
            }
        })
    })
})
def enrichment (event: SomeEventType df: DataFrame): String = {
    ...
    try {
        df.select(col("id")).first().getString(0) 
    } catch {
        case e: Exception => println("not record found") 
    }
}

基本上,对于每个RDD,我都将参考数据加载到datafame中,并传递数据框以根据某个ID丰富每个记录。执行代码没有错误,但是永远不会发生填充。问题是df(dataframe)是无效的树。我的代码/逻辑有什么问题?

另一个问题-是这样做的正确方法,基本上,我只想从HDFS的每个分区而不是每个记录中读取数据。

谢谢!

0 个答案:

没有答案