在我的结构化流应用程序中,我试图根据我在数据集上进行的转换列出异常值。
我正在读取由应用程序生成的日志文件,并希望在其上运行流式应用程序以列出违规行为。
这是我的代码
val customSchema = StructType(
Array(
StructField("MessageId", StringType, false),
StructField("Current_Timestamp", TimestampType, false),
StructField("msgSeqID", LongType, false),
StructField("site", StringType, false),
StructField("msgType", StringType, false)
)
)
val logFileDF = sparkSession.sqlContext.readStream.format("csv")
.option("delimiter",",")
.option("header", false)
.option("mode", "DROPMALFORMED")
.schema(customSchema)
.load(logFilePath)
.toDF()
logFileDF.printSchema()
logFileDF.createOrReplaceTempView("LogData")
val selectQuery: String = "SELECT MessageId,Current_Timestamp,site,msgSeqID,msgType,lag(msgSeqID,1) over (partition by site,msgType order by site,msgType,msgSeqID) as prev_val FROM LogData order by site,msgType,msgSeqID"
val logFileLagRowNumDF = sparkSession.sqlContext.sql(selectQuery).toDF()
logFileLagRowNumDF.printSchema()
logFileLagRowNumDF.createOrReplaceTempView("LogDataUpdated")
val errorRecordQuery: String = "select * from LogDataUpdated where prev_val!=null and msgSeqID != prev_val+1 order by site,msgType,msgSeqID";
val errorRecordQueryDF = spark.sqlContext.sql(errorRecordQuery).toDF()
errorRecordQueryDF.isStreaming
/*
val outputDF = errorRecordQueryDF
.withWatermark("Current_Timestamp", "5 seconds")
.groupBy($"site").count()
*/
val outputDF = errorRecordQueryDF
.groupBy(window($"Current_Timestamp","5 seconds","1 second"),$"site")
.count()
outputDF.writeStream.format("console")
.option("checkpointLocation", "/tmp/logs/chkpoint")
.outputMode("complete")
.option("path", "/tmp/logs/output")
.trigger(Trigger.Continuous("1 second"))
.start()
我遇到以下错误:
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;;
如果我不想进行聚合并在连续流上使用foreach(让我说我想将所有错误记录记录到文件系统或将它们发送到某个外部系统)
尝试了以下
errorRecordQueryDF.writeStream.foreach(new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(record: Row): Unit = {
System.out.println(record)
}
def close(errorOrNull: Throwable): Unit = {
// Close the connection
}
}
).start()
这将返回以下错误:
Message: Non-time-based windows are not supported on streaming DataFrames/Datasets;;
好像我需要按时间戳进行分区,而不是在站点上进行msgType分区,但这会改变查询的语义。
谢谢 卫星