如何将spark rdd转换为结构化流媒体以进行时间窗口处理?
例如,我想从es查询数据集并将其作为结构化流处理。
conf = {"es.resource" : "index/type"} # assume Elasticsearch is running on localhost defaults
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",
"org.apache.hadoop.io.NullWritable",
"org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=conf)
和streamdf = function(rdd)
streamdf.groupBy(
window(streamdf.event_time, windowDuration, slideDuration),
streamdf.mykey
).count().orderBy('window')
功能是什么?