Spark Streaming Empty RDD问题

时间:2017-02-14 01:17:11

标签: scala apache-spark spark-streaming

我正在尝试从RDBMS创建自定义流接收器。

val dataDStream = ssc.receiverStream(new inputReceiver ())
  dataDStream.foreachRDD((rdd:RDD[String],time:Time)=> {
    val newdata=rdd.flatMap(x=>x.split(","))
    newdata.foreach(println)  // *******This line has problem, newdata has no records
  })

ssc.start()
ssc.awaitTermination()
}

class inputReceiver extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
  def onStart() {
    // Start the thread that receives data over a connection
    new Thread("RDBMS data Receiver") {
      override def run() {
        receive()
      }
    }.start()
  }
  def onStop() {
  }

  def receive() {
    val sqlcontext = SQLContextSingleton.getInstance()

    // **** I am assuming something wrong in following code
    val DF = sqlcontext.read.json("/home/cloudera/data/s.json")
    for (data <- rdd) {
      store(data.toString())
    }
    logInfo("Stopped receiving")
    restart("Trying to connect again")
  }
}

代码正在执行而没有错误,但是没有从数据框打印任何记录。

我正在使用Spark 1.6和Scala

1 个答案:

答案 0 :(得分:0)

要使代码正常工作,您应该更改以下内容:

def receive() {
  val sqlcontext = SQLContextSingleton.getInstance()
  val DF = sqlcontext.read.json("/home/cloudera/data/s.json")

  // **** this:
  rdd.collect.foreach(data => store(data.toString()))

  logInfo("Stopped receiving")
  restart("Trying to connect again")
}

HOWEVER 这是不可取的,因为驱动程序将处理json文件中的所有数据,并且未充分考虑接收器的可靠性。

我怀疑Spark Streaming不适合您的用例。在行间读取,似乎要么是流式传输,因此需要一个合适的生产者,或者你正在读取从RDBMS转储到json中的数据,在这种情况下你不需要Spark Streaming。