Question

JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(1000));

我的HDFS目录包含json文件

Answer 1

您可以使用textFileStream将其作为文本文件阅读并稍后进行转换。

val dstream = ssc.textFileStream("path to hdfs directory")

这会为您DStream[Strings]提供RDD[String]

的集合

然后你可以得到每个时间间隔的RDD

dstream.foreachRDD(rdd => {
  //now apply a transformation or anything with the each rdd
 spark.read.json(rdd) // to change it to dataframe
})

scc.start()             // Start the computation
ssc.awaitTermination()   // Wait for the computation to terminate

希望这有帮助

如何使用spark streaming从HDFS读取数据？

1 个答案: