Question

对于给定的场景，我想结合使用连续和批处理触发器来过滤结构化流中的数据集。

我知道这听起来不现实或可能不可行。以下是我要实现的目标。

让应用程序中设置的处理时间间隔为5分钟。让记录为以下架构：

  {
       "type":"record",
       "name":"event",
       "fields":[
         { "name":"Student", "type":"string" },
         { "name":"Subject", "type":"string" } 
   ]}

我的流媒体应用程序应该考虑以下两个条件之一，将结果写入接收器。

如果学生的科目超过5个。（优先考虑此标准。）

触发器中提供的处理时间已过期。

private static Injection<GenericRecord, byte[]> recordInjection;
private static StructType type;
public static final String USER_SCHEMA = "{"
        + "\"type\":\"record\","
        + "\"name\":\"alarm\","
        + "\"fields\":["
        + "  { \"name\":\"student\", \"type\":\"string\" },"
        + "  { \"name\":\"subject\", \"type\":\"string\" }"
        + "]}";

private static Schema.Parser parser = new Schema.Parser();

private static Schema schema = parser.parse(USER_SCHEMA);

static {
    recordInjection = GenericAvroCodecs.toBinary(schema);
    type = (StructType) SchemaConverters.toSqlType(schema).dataType();

}
sparkSession.udf().register("deserialize", (byte[] data) -> {
        GenericRecord record = recordInjection.invert(data).get();
        return RowFactory.create(record.get("student").toString(), record.get("subject").toString());
    }, DataTypes.createStructType(type.fields()));


Dataset<Row> ds2 = ds1
        .select("value").as(Encoders.BINARY())
        .selectExpr("deserialize(value) as rows")
        .select("rows.*")
        .selectExpr("student","subject");

StreamingQuery query1 = ds2
        .writeStream()
        .foreachBatch(
            new VoidFunction2<Dataset<Row>, Long>() {
              @Override
              public void call(Dataset<Row> rowDataset, Long aLong) throws Exception {
                rowDataset.select("student,concat(',',subject)").alias("value").groupBy("student");
              }
            }
        ).format("kafka")
        .option("kafka.bootstrap.servers", "localhost:9092")
        .option("topic", "new_in")
        .option("checkpointLocation", "checkpoint")
        .outputMode("append")
        .trigger(Trigger.ProcessingTime(10000))
        .start();
query1.awaitTermination();

Kafka Producer控制台：

Student:Test, Subject:x
Student:Test, Subject:y
Student:Test, Subject:z
Student:Test1, Subject:x
Student:Test2, Subject:x
Student:Test, Subject:w
Student:Test1, Subject:y
Student:Test2, Subject:y
Student:Test, Subject:v

在Kafka用户控制台中，我期望如下所示。

Test:{x,y,z,w,v} =>This should be the first response 
Test1:{x,y} => second
Test2:{x,y} => Third by the end of processing time

在Spark结构化流中操纵触发间隔

0 个答案: