Apache Spark Streaming - 当流没有更多数据时正常关闭

时间:2016-02-20 09:43:15

标签: apache-spark spark-streaming

我正在开发Spark Streaming API,我希望连续流式传输一组预先下载的Web日志文件,以模拟实时流。我编写了一个脚本,用于压缩压缩日志并将输出通过端口7777传输到nc。

脚本如下所示:

BASEDIR=/home/mysuer/data/datamining/internet_traffic_archive
zipped_files=`find $BASEDIR -name "*.gz"`

for zfile in $zipped_files
 do
  echo "Unzipping $zfile..."
  gunzip -c $zfile  | nc -l -p 7777 -q 20

 done

我有用Scala编写的流代码来处理流。它在大多数情况下运行良好,但是当它用完文件流时它似乎出现以下错误:

16/02/19 23:04:35 WARN ReceiverSupervisorImpl: 
Restarting receiver with delay 2000 ms: Socket data stream had no more data
16/02/19 23:04:35 ERROR ReceiverTracker: Deregistered receiver for stream 0: 
Restarting receiver with delay 2000ms: Socket data stream had no more data
16/02/19 23:04:35 WARN BlockManager: Block input-0-1455941075600 replicated to only 0 peer(s) instead of 1 peers
....
16/02/19 23:04:40 ERROR Executor: Exception in task 2.0 in stage 15.0 (TID 47)
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:313)
at scala.None$.get(Option.scala:311)
at com.femibyte.learningsparkaddexamples.scala.StreamingLogEnhanced$$anonfun$2.apply(StreamingLogEnhanced.scala:42)
at com.femibyte.learningsparkaddexamples.scala.StreamingLogEnhanced$$anonfun$2.apply(StreamingLogEnhanced.scala:42)

如何实现正常关闭,以便程序在不再检测到流中的任何数据时正常退出?

我的scala代码如下所示:

object StreamingLogEnhanced {
  def main(args: Array[String]) {
    val master = args(0)
    val conf = new SparkConf().setMaster(master).setAppName("StreamingLogEnhanced")
    // Create a StreamingContext with a n second batch size
    val ssc = new StreamingContext(conf, Seconds(10))
    // Create a DStream from all the input on port 7777
    val log = Logger.getLogger(getClass.getName)

    sys.ShutdownHookThread {
      log.info("Gracefully stopping Spark Streaming Application")
      ssc.stop(true, true)
      log.info("Application stopped")
    }
    val lines = ssc.socketTextStream("localhost", 7777)
    // Create a count of log hits by ip
    var ipCounts=countByIp(lines)
    ipCounts.print()

    // start our streaming context and wait for it to "finish"
    ssc.start()
    // Wait for 600 seconds then exit
    ssc.awaitTermination(10000*600)
    ssc.stop()
  }

  def countByIp(lines: DStream[String]) = {
    val parser = new AccessLogParser
    val accessLogDStream = lines.map(line => parser.parseRecord(line))
    val ipDStream = accessLogDStream.map(entry => (entry.get.clientIpAddress, 1))
    ipDStream.reduceByKey((x, y) => x + y)
  }
}

0 个答案:

没有答案