我需要根据火花驱动程序时间戳对输入数据执行聚合,而不使用水印。我的数据没有任何时间戳字段。
要求是:计算每秒收到的数据的平均值(发送时并不重要)
例如,我需要对每个触发器收到的数据进行聚合,就像之前的RDD流API一样。
有办法吗?
答案 0 :(得分:1)
您可以创建自己的接收器并在每次addBatch()调用上执行操作:
***************************************************************************
**** Wrapper Script ****
***************************************************************************
@echo off
Setlocal enabledelayedexpansion
for /F %%i in ('dir /b "\\abcinc.lcl\utility\aaa\*.csv"') do (
echo Folder is NON empty
move "\\abcinc.lcl\utility\aaa\*.csv" E:\abc\INFILES
For %%a in (E:\abc\INFILES\*.csv) Do (
PING 10.1.41.19 -n 5 >NUL
Set "File=LOADIN.csv"
Ren "%%a" "!File!"
CALL "E:\abc\scripts\RUNALL.BAT"
PING 10.1.41.19 -n 5 >NUL
MOVE "E:\abc\INFILES\LOADIN.CSV" "E:\abc\INFILES\ARCHIVE\LOADIN - %DATE:/=-% %TIME::=-%.CSV"
PING 10.2.23.49 -n 3 >NUL
MOVE "\\abcinc.lcl\utility\aaa\OUTPUT\outfile.csv" "\\abcinc.lcl\utility\aaa\OUTPUT\ARCHIVE\outfile - %DATE:/=-% %TIME::=-%.CSV"
)
)
EXIT
***********************************************************************
**** LOG OUTPUT ****
***********************************************************************
Folder is NON empty
\\abcinc.lcl\utility\aaa\File1.csv
\\abcinc.lcl\utility\aaa\File2.csv
2 file(s) moved.
RunAll script ran fine.
1 file(s) moved.
1 file(s) moved.
RunAll script ran fine.
1 file(s) moved.
1 file(s) moved.
Folder is NON empty
A duplicate file name exists, or the file
cannot be found.
将outputMode设置为Update并每隔X秒触发一次
class CustomSink extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
data.groupBy().agg(sum("age") as "sumAge").foreach(v => println(s"RESULT=$v"))
}
}
class CustomSinkProvider extends StreamSinkProvider with DataSourceRegister {
def createSink(
sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink = {
new PersonSink()
}
def shortName(): String = "person"
}
答案 1 :(得分:0)
“按处理时间触发”是否符合您的要求? “按处理时间触发”触发每个间隔(由代码定义)。
示例触发器代码如下。