我正在尝试从RDBMS创建自定义流接收器。
val dataDStream = ssc.receiverStream(new inputReceiver ())
dataDStream.foreachRDD((rdd:RDD[String],time:Time)=> {
val newdata=rdd.flatMap(x=>x.split(","))
newdata.foreach(println) // *******This line has problem, newdata has no records
})
ssc.start()
ssc.awaitTermination()
}
class inputReceiver extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
// Start the thread that receives data over a connection
new Thread("RDBMS data Receiver") {
override def run() {
receive()
}
}.start()
}
def onStop() {
}
def receive() {
val sqlcontext = SQLContextSingleton.getInstance()
// **** I am assuming something wrong in following code
val DF = sqlcontext.read.json("/home/cloudera/data/s.json")
for (data <- rdd) {
store(data.toString())
}
logInfo("Stopped receiving")
restart("Trying to connect again")
}
}
代码正在执行而没有错误,但是没有从数据框打印任何记录。
我正在使用Spark 1.6和Scala
答案 0 :(得分:0)
要使代码正常工作,您应该更改以下内容:
def receive() {
val sqlcontext = SQLContextSingleton.getInstance()
val DF = sqlcontext.read.json("/home/cloudera/data/s.json")
// **** this:
rdd.collect.foreach(data => store(data.toString()))
logInfo("Stopped receiving")
restart("Trying to connect again")
}
HOWEVER 这是不可取的,因为驱动程序将处理json文件中的所有数据,并且未充分考虑接收器的可靠性。
我怀疑Spark Streaming不适合您的用例。在行间读取,似乎要么是流式传输,因此需要一个合适的生产者,或者你正在读取从RDBMS转储到json中的数据,在这种情况下你不需要Spark Streaming。