Flume + Spark - 在HDFS中存储DStream

时间:2016-04-01 07:05:52

标签: apache-spark spark-streaming flume

我有水槽流,我想通过spark存储在HDFS中。下面是我正在运行的火花代码

object FlumePull {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println(
        "Usage: FlumePollingEventCount <host> <port>")
      System.exit(1)
    }

    val batchInterval = Milliseconds(60000)
    val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
    val ssc = new StreamingContext(sparkConf, batchInterval)
    val stream = FlumeUtils.createPollingStream(ssc, "localhost", 9999)

    stream.map(x => x + "!!!!")
          .saveAsTextFiles("/user/root/spark/flume_Map_", "_Mapout")

    ssc.start()
    ssc.awaitTermination()
  }
}

当我开始我的spsark流式传输作业时,它确实将输出存储在HDFS中,但输出是这样的:

[root@sandbox ~]# hadoop fs -cat /user/root/spark/flume_Map_-1459450380000._Mapout/part-00000
org.apache.spark.streaming.flume.SparkFlumeEvent@1b9bd2c9!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@33fd3a48!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@35fd67a2!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@f9ed85f!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@58f4cfc!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@307373e!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@4ebbc8ff!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@a8905bb!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@29d73d64!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@71ff85b1!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@3ea261ef!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@16cbb209!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@17157890!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent@29e41c7!!!!

存储水槽事件而不是来自Flume的数据。如何从中获取数据?

由于

1 个答案:

答案 0 :(得分:0)

您需要从SparkFlumeEvent中提取底层缓冲区并保存。例如,如果您的活动正文为String

stream.map(x => new String(x.event.getBody.array) + "!!!!")
      .saveAsTextFiles("/user/root/spark/flume_Map_", "_Mapout")