我有2个Scala应用程序: 一个人在HDFS文件夹中搜索新文件(然后在转换数据后再推送到API),另一个从DB / 2读取数据并将记录作为JSON写入HDFS文件系统:
流:
/**
* Read data from HDFS, transform and push to API
*/
object DataMerge extends App {
implicit val formats = DefaultFormats
val conf = new SparkConf().setMaster("local[2]").setAppName("data_merger")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(1))
val mainStream = ssc.textFileStream("hdfs://localhost:9000/spark/messages")
val messages = mainStream.filter(_.contains("objectUid"))
messages.print()
ssc.checkpoint("hdfs://localhost:9000/spark/checkpoint")
ssc.start()
ssc.awaitTermination()
}
写入HDFS:
/**
* Read data from DB2 and write to HDFS
*/
object DB2DataExtractor extends App {
implicit val formats = Serialization.formats(NoTypeHints)
val conf = new SparkConf().setMaster("local[2]").setAppName("smd_migrator_2")
val sc = new SparkContext(conf)
def createConnection() = {
Class.forName("com.ibm.db2.jcc.DB2Driver").newInstance()
DriverManager.getConnection("jdbc:db2://server:1234/database", "user", "password")
}
def extractValues(rs: ResultSet): String = {
write(new Message(rs.getString(1), rs.getLong(2), rs.getLong(3), rs.getString(4), rs.getString(5),))
}
val query =
"""SELECT *
|FROM (
| SELECT ROW_NUMBER() OVER(ORDER BY o.insert_time DESC) AS rn,
| o.*,
| m.*
| FROM schema.table o
| JOIN schema2.table2 m ON
| o.external_id = m.external_id
| )
|WHERE rn BETWEEN ? and ?""".stripMargin
val data = new JdbcRDD(sc, createConnection, query, lowerBound = 1, upperBound = 2000,
numPartitions = 5, mapRow = extractValues)
val file = "hdfs://localhost:9000/spark/messages"
data.saveAsTextFile(file)
}
我首先启动DataMerge应用程序以侦听不存在的目录(这样一旦它们存在就可以拾取它们)。运行DB2DataExtractor后,数据将写入HDFS文件,但DataMerge应用程序不会将数据输出到命令行。有证据表明它找到了新文件:
15/05/06 13:26:28 INFO FileInputDStream: Finding new files took 1 ms
15/05/06 13:26:28 INFO FileInputDStream: New files at time 1430943988000 ms:
15/05/06 13:26:28 INFO JobScheduler: Added jobs for time 1430943988000 ms
15/05/06 13:26:28 INFO JobGenerator: Checkpointing graph for time 1430943988000 ms
15/05/06 13:26:28 INFO DStreamGraph: Updating checkpoint data for time 1430943988000 ms
15/05/06 13:26:28 INFO JobScheduler: Starting job streaming job 1430943988000 ms.0 from job set of time 1430943988000 ms
-------------------------------------------
Time: 1430943988000 ms
-------------------------------------------
到目前为止,我没有看到准备文件的任何输出。我错过了什么吗?
我从命令行运行这些应用程序。
更新
我将代码更新为:
val ssc = new StreamingContext(sc, Seconds(5))
val mainStream = ssc.textFileStream("hdfs://localhost:9000/spark/messages")
mainStream.print
mainStream.foreachRDD(_.foreach(println))
ssc.checkpoint("hdfs://localhost:9000/spark/checkpoint")
ssc.start()
ssc.awaitTermination()
我可以看到文件被识别出来了:
15/05/06 16:32:35 INFO FileInputDStream: New files at time 1430955155000 ms:
hdfs://localhost:9000/spark/messages/_SUCCESS
hdfs://localhost:9000/spark/messages/part-00000
hdfs://localhost:9000/spark/messages/part-00001
hdfs://localhost:9000/spark/messages/part-00002
hdfs://localhost:9000/spark/messages/part-00003
hdfs://localhost:9000/spark/messages/part-00004
但是看到以下堆栈跟踪:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/spark/messages/_SUCCESS
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)