来自Hadoop的流式文件

时间:2015-05-06 21:07:56

标签: scala hadoop apache-spark spark-streaming

我有2个Scala应用程序: 一个人在HDFS文件夹中搜索新文件(然后在转换数据后再推送到API),另一个从DB / 2读取数据并将记录作为JSON写入HDFS文件系统:

流:

/**
 * Read data from HDFS, transform and push to API
 */
object DataMerge extends App {
  implicit val formats = DefaultFormats

  val conf =  new SparkConf().setMaster("local[2]").setAppName("data_merger")
  val sc = new SparkContext(conf)

  val ssc = new StreamingContext(sc, Seconds(1))

  val mainStream = ssc.textFileStream("hdfs://localhost:9000/spark/messages")

  val messages = mainStream.filter(_.contains("objectUid"))
  messages.print()

  ssc.checkpoint("hdfs://localhost:9000/spark/checkpoint")

  ssc.start()
  ssc.awaitTermination()
}

写入HDFS:

/**
 * Read data from DB2 and write to HDFS
 */
object DB2DataExtractor extends App {
  implicit val formats = Serialization.formats(NoTypeHints)

  val conf =  new SparkConf().setMaster("local[2]").setAppName("smd_migrator_2")
  val sc = new SparkContext(conf)

  def createConnection() = {
    Class.forName("com.ibm.db2.jcc.DB2Driver").newInstance()
    DriverManager.getConnection("jdbc:db2://server:1234/database", "user", "password")
  }

  def extractValues(rs: ResultSet): String = {
    write(new Message(rs.getString(1), rs.getLong(2), rs.getLong(3), rs.getString(4), rs.getString(5),))
  }

  val query =
    """SELECT *
      |FROM (
      |   SELECT ROW_NUMBER() OVER(ORDER BY o.insert_time DESC) AS rn,
      |   o.*,
      |   m.*
      |   FROM schema.table o
      |   JOIN schema2.table2 m ON
      |   o.external_id = m.external_id
      |   )
      |WHERE rn BETWEEN ? and ?""".stripMargin

  val data = new JdbcRDD(sc, createConnection, query, lowerBound = 1, upperBound = 2000,
numPartitions = 5, mapRow = extractValues)

  val file = "hdfs://localhost:9000/spark/messages"
  data.saveAsTextFile(file)
}

我首先启动DataMerge应用程序以侦听不存在的目录(这样一旦它们存在就可以拾取它们)。运行DB2DataExtractor后,数据将写入HDFS文件,但DataMerge应用程序不会将数据输出到命令行。有证据表明它找到了新文件:

15/05/06 13:26:28 INFO FileInputDStream: Finding new files took 1 ms
15/05/06 13:26:28 INFO FileInputDStream: New files at time 1430943988000 ms:

15/05/06 13:26:28 INFO JobScheduler: Added jobs for time 1430943988000 ms
15/05/06 13:26:28 INFO JobGenerator: Checkpointing graph for time 1430943988000 ms
15/05/06 13:26:28 INFO DStreamGraph: Updating checkpoint data for time 1430943988000 ms
15/05/06 13:26:28 INFO JobScheduler: Starting job streaming job 1430943988000 ms.0 from job set of time 1430943988000 ms
-------------------------------------------
Time: 1430943988000 ms
-------------------------------------------

到目前为止,我没有看到准备文件的任何输出。我错过了什么吗?

我从命令行运行这些应用程序。

更新

我将代码更新为:

val ssc = new StreamingContext(sc, Seconds(5))

val mainStream = ssc.textFileStream("hdfs://localhost:9000/spark/messages")
mainStream.print
mainStream.foreachRDD(_.foreach(println))

ssc.checkpoint("hdfs://localhost:9000/spark/checkpoint")

ssc.start()
ssc.awaitTermination()

我可以看到文件被识别出来了:

15/05/06 16:32:35 INFO FileInputDStream: New files at time 1430955155000 ms:
hdfs://localhost:9000/spark/messages/_SUCCESS
hdfs://localhost:9000/spark/messages/part-00000
hdfs://localhost:9000/spark/messages/part-00001
hdfs://localhost:9000/spark/messages/part-00002
hdfs://localhost:9000/spark/messages/part-00003
hdfs://localhost:9000/spark/messages/part-00004

但是看到以下堆栈跟踪:

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/spark/messages/_SUCCESS
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

0 个答案:

没有答案