我正在尝试从Jacek Laskowski的书中模仿this example,以读取CSV文件并在控制台中聚合数据,但是由于某些原因,输出未显示在InteliJ控制台中。
scala> spark.version
res4: String = 2.2.0
在SO的某些地方(1,2,3,4,5),我找到了一些参考资料,但我尝试了一切没有解决问题。
这是代码:
package org.sample
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
object App {
def main(args : Array[String]): Unit = {
val DIR = new java.io.File(".").getCanonicalPath + "dataset/stream_in"
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Spark Structured Streaming Job")
val spark = SparkSession.builder()
.appName("Spark Structured Streaming Job")
.master("local[*]")
.getOrCreate()
val reader = spark.readStream
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("latestFirst", "true")
.schema(SchemaDefinition.csvSchema)
.load(DIR + "/*")
reader.createOrReplaceTempView("user_records")
val tranformation = spark.sql(
"""
SELECT carrier, marital_status, COUNT(1) as num_users
FROM user_records
GROUP BY carrier, marital_status
"""
)
val consoleStream = tranformation
.writeStream
.format("console")
.option("truncate", false)
.outputMode("complete")
.start()
consoleStream.awaitTermination()
}
}
我的输出仅是:
18/11/30 15:40:31 INFO StreamExecution: Streaming query made progress: {
"id" : "9420f826-0daf-40c9-a427-e89ed42ee738",
"runId" : "991c9085-3425-4ea6-82af-4cef20007a66",
"name" : null,
"timestamp" : "2018-11-30T14:40:31.117Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 2,
"triggerExecution" : 2
},
"eventTime" : {
"watermark" : "1970-01-01T00:00:00.000Z"
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[file:/structured-streamming-taskdataset/stream_in/*]",
"startOffset" : null,
"endOffset" : null,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@6a62e7ef"
}
}
答案 0 :(得分:0)
我重新定义了文件,现在对我有用:
差异:
conf
。使用SparkSession
,我们不需要
叫conf
.load(/*)
无效。有效的是保持唯一的道路
dataset/stream_in
; tranformation
的数据是错误的(字段与
文件)最终代码:
package org.sample
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object StreamCities {
def main(args : Array[String]): Unit = {
// Turn off logs in console
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = SparkSession.builder()
.appName("Spark Structured Streaming get CSV and agregate")
.master("local[*]")
.getOrCreate()
// 01. Schema Definition: We'll put the structure of our
// CSV file. Can be done using a class, but for simplicity
// I'll keep it here
import org.apache.spark.sql.types._
def csvSchema = StructType {
StructType(Array(
StructField("id", StringType, true),
StructField("name", StringType, true),
StructField("city", StringType, true)
))
}
// 02. Read the Stream: Create DataFrame representing the
// stream of the CSV according our Schema. The source it is
// the folder in the .load() option
val users = spark.readStream
.format("csv")
.option("sep", ",")
.option("header", true)
.schema(csvSchema)
.load("dataset/stream_in")
// 03. Aggregation of the Stream: To use the .writeStream()
// we must pass a DF aggregated. We can do this using the
// Untyped API or SparkSQL
// 03.1: Aggregation using untyped API
//val aggUsers = users.groupBy("city").count()
// 03.2: Aggregation using Spark SQL
users.createOrReplaceTempView("user_records")
val aggUsers = spark.sql(
"""
SELECT city, COUNT(1) as num_users
FROM user_records
GROUP BY city"""
)
// Print the schema of our aggregation
println(aggUsers.printSchema())
// 04. Output the stream: Now we'll write our stream in
// console and as new files will be included in the folder
// that Spark it's listening the results will be updated
val consoleStream = aggUsers.writeStream
.outputMode("complete")
.format("console")
.start()
.awaitTermination()
}
}