对于给定的场景,我想结合使用连续和批处理触发器来过滤结构化流中的数据集。
我知道这听起来不现实或可能不可行。以下是我要实现的目标。
让应用程序中设置的处理时间间隔为5分钟。 让记录为以下架构:
{
"type":"record",
"name":"event",
"fields":[
{ "name":"Student", "type":"string" },
{ "name":"Subject", "type":"string" }
]}
我的流媒体应用程序应该考虑以下两个条件之一,将结果写入接收器。
如果学生的科目超过5个。 (优先考虑此标准。)
触发器中提供的处理时间已过期。
private static Injection<GenericRecord, byte[]> recordInjection;
private static StructType type;
public static final String USER_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"alarm\","
+ "\"fields\":["
+ " { \"name\":\"student\", \"type\":\"string\" },"
+ " { \"name\":\"subject\", \"type\":\"string\" }"
+ "]}";
private static Schema.Parser parser = new Schema.Parser();
private static Schema schema = parser.parse(USER_SCHEMA);
static {
recordInjection = GenericAvroCodecs.toBinary(schema);
type = (StructType) SchemaConverters.toSqlType(schema).dataType();
}
sparkSession.udf().register("deserialize", (byte[] data) -> {
GenericRecord record = recordInjection.invert(data).get();
return RowFactory.create(record.get("student").toString(), record.get("subject").toString());
}, DataTypes.createStructType(type.fields()));
Dataset<Row> ds2 = ds1
.select("value").as(Encoders.BINARY())
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr("student","subject");
StreamingQuery query1 = ds2
.writeStream()
.foreachBatch(
new VoidFunction2<Dataset<Row>, Long>() {
@Override
public void call(Dataset<Row> rowDataset, Long aLong) throws Exception {
rowDataset.select("student,concat(',',subject)").alias("value").groupBy("student");
}
}
).format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "new_in")
.option("checkpointLocation", "checkpoint")
.outputMode("append")
.trigger(Trigger.ProcessingTime(10000))
.start();
query1.awaitTermination();
Kafka Producer控制台:
Student:Test, Subject:x
Student:Test, Subject:y
Student:Test, Subject:z
Student:Test1, Subject:x
Student:Test2, Subject:x
Student:Test, Subject:w
Student:Test1, Subject:y
Student:Test2, Subject:y
Student:Test, Subject:v
在Kafka用户控制台中,我期望如下所示。
Test:{x,y,z,w,v} =>This should be the first response
Test1:{x,y} => second
Test2:{x,y} => Third by the end of processing time