kafka -spark流程序,下面是我的代码,简单的一个
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types.{StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object StructuredStreaming {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("Spark-Kafka-Integration")
.master("local")
.getOrCreate()
val mySchema = StructType(Array(
StructField("id", IntegerType),
StructField("name", StringType),
StructField("year", IntegerType),
StructField("rating", DoubleType),
StructField("duration", IntegerType)
))
val streamingDataFrame = spark.readStream.schema(mySchema)
.csv("/home/Desktop/DataFiles/csvDemo.csv")
streamingDataFrame.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value").
writeStream
.format("kafka")
.option("topic", "topicName")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/home/Desktop/Checkpoint1")
.start()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicName")
.option("startingOffsets","earliest")
// .option("")
.load()
val df1 = df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", mySchema).as("data"), $"timestamp")
.select("data.*", "timestamp")
df1.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
}
}
在运行此代码时-无法获得任何输出。其在下面的过程中显示并持续运行,直到我终止此过程。
------------------------------------------- Batch: 0
------------------------------------------- 18/08/21 13:37:42 INFO CodeGenerator: Code generated in 53.016751 ms 18/08/21 13:37:42 INFO WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@146f246e committed.
+---+----+----+------+--------+---------+ |id |name|year|rating|duration|timestamp|
+---+----+----+------+--------+---------+
+---+----+----+------+--------+---------+
18/08/21 13:37:42 INFO SparkContext: Starting job: start at StructuredStreaming.scala:60 18/08/21 13:37:42 INFO DAGScheduler: Job 1 finished: start at StructuredStreaming.scala:60, took 0.000047 s 18/08/21 13:37:42 INFO ContextCleaner: Cleaned accumulator 26 18/08/21 13:37:42 INFO MicroBatchExecution: Streaming query made progress: { "id" : "9cd8df96-218a-4e1b-a6f0-66d8fcc170de", "runId" : "c3ac3df7-0343-4710-8f52-019b59f6fefa", "name" : null, "timestamp" : "2018-08-21T08:07:37.340Z", "batchId" : 0, "numInputRows" : 0, "processedRowsPerSecond" : 0.0, "durationMs" : {
"addBatch" : 2702,
"getBatch" : 681,
"getOffset" : 868,
"queryPlanning" : 512,
"triggerExecution" : 4813,
"walCommit" : 31 }, "stateOperators" : [ ], "sources" : [ {
"description" : "KafkaSource[Subscribe[topicName]]",
"startOffset" : null,
"endOffset" : {
"topicName" : {
"0" : 0
}
},
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0 } ], "sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@260bf6a5" } } 18/08/21 13:37:42 INFO MicroBatchExecution: Streaming query made progress: { "id" : "9cd8df96-218a-4e1b-a6f0-66d8fcc170de", "runId" : "c3ac3df7-0343-4710-8f52-019b59f6fefa", "name" : null, "timestamp" : "2018-08-21T08:07:42.338Z", "batchId" : 1, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, "durationMs" : {
"getOffset" : 14,
"triggerExecution" : 15 }, "stateOperators" : [ ], "sources" : [ {
"description" : "KafkaSource[Subscribe[topicName]]",
"startOffset" : {
"topicName" : {
"0" : 0
}
},
"endOffset" : {
"topicName" : {
"0" : 0
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0 } ], "sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@260bf6a5" } } 18/08/21 13:37:52 INFO MicroBatchExecution: Streaming query made progress: { "id" : "9cd8df96-218a-4e1b-a6f0-66d8fcc170de", "runId" : "c3ac3df7-0343-4710-8f52-019b59f6fefa", "name" : null, "timestamp" : "2018-08-21T08:07:52.355Z", "batchId" : 1, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, "durationMs" : {
"getOffset" : 1,
"triggerExecution" : 2 }, "stateOperators" : [ ], "sources" : [ {
"description" : "KafkaSource[Subscribe[topicName]]",
"startOffset" : {
"topicName" : {
"0" : 0
}
},
"endOffset" : {
"topicName" : {
"0" : 0
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0 } ], "sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@260bf6a5" } } 18/08/21 13:38:02 INFO MicroBatchExecution: Streaming query made progress: { "id" : "9cd8df96-218a-4e1b-a6f0-66d8fcc170de", "runId" : "c3ac3df7-0343-4710-8f52-019b59f6fefa", "name" : null, "timestamp" : "2018-08-21T08:08:02.366Z", "batchId" : 1, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "durationMs" : {
"getOffset" : 0,
"triggerExecution" : 0 }, "stateOperators" : [ ], "sources" : [ {
"description" : "KafkaSource[Subscribe[topicName]]",
"startOffset" : {
"topicName" : {
"0" : 0
}
},
"endOffset" : {
"topicName" : {
"0" : 0
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0 } ], "sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@260bf6a5" } }
已经快半小时了,仍然没有显示任何内容。
csv文件就是这样,有50,000条记录
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333
11,Broken Blossoms,1919,3.3,5367
12,Big Night,1996,3.6,6561
13,The Birth of a Nation,1915,2.9,12118
14,The Boys from Brazil,1978,3.6,7417
15,Big Doll House,1971,2.9,5696
16,The Breakfast Club,1985,4.0,5823
17,The Bride of Frankenstein,1935,3.7,4485
18,Beautiful Girls,1996,3.5,6755
19,Bustin' Loose,1981,3.7,5598
20,The Beguiled,1971,3.4,6307
21,Born on the Fourth of July,1989,3.4,8646
22,Broadcast News,1987,3.4,7940
23,Swimming with Sharks,1994,3.3,5586
24,Beavis and Butt-head Do America,1996,3.4,4852
25,Brighton Beach Memoirs,1986,3.4,6564
26,The Best of Times,1986,3.4,6247
27,Brassed Off,1996,3.5,6040
28,Last Tango in Paris,1972,3.1,7732
29,Leprechaun 2,1994,3.2,5125
30,Incident at Oglala: The Leonard Peltier Story,1992,3.7,5487
现在如果我通过控制台生产者添加csv数据,那么我得到
| id |名称|年份|等级|持续时间|时间戳| + ---- + ---- + ---- + ------ + -------- + ------------------ ----- + | null | null | null | null | null | 2018-08-21 14:57:37.848 | | null | null | null | null | null | 2018-08-21 14:57:37.856 | | null | null | null | null | null | 2018-08-21 14:57:37.857 | | null | null | null | null | null | 2018-08-21 14:57:37.857 | | null | null | null | null | null | 2018-08-21 14:57:37.857 | | null | null | null | null | null | 2018-08-21 14:57:37.857 | | null | null | null | null | null | 2018-08-21 14:57:37.857 | | null | null | null | null | null | 2018-08-21 14:57:37.857 | | null | null | null | null | null | 2018-08-21 14:57:37.857 | | null | null | null | null | null | 2018-08-21 14:57:37.858 | | null | null | null | null | null | 2018-08-21 14:57:37.858 | | null | null | null | null | null | 2018-08-21 14:57:37.858 | | null | null | null | null | null | 2018-08-21 14:57:37.858 | | null | null | null | null | null | 2018-08-21 14:57:37.858 | | null | null | null | null | null | 2018-08-21 14:57:37.858 | | null | null | null | null | null | 2018-08-21 14:57:37.858 | | null | null | null | null | null | 2018-08-21 14:57:37.858 | | null | null | null | null | null | 2018-08-21 14:57:37.859 | | null | null | null | null | null | 2018-08-21 14:57:37.859 | | null | null | null | null | null | 2018-08-21 14:57:37.859 |