在Spark结构化流中操纵触发间隔

时间:2019-10-24 14:52:16

标签: apache-spark apache-kafka apache-spark-sql spark-structured-streaming

对于给定的场景,我想结合使用连续和批处理触发器来过滤结构化流中的数据集。

我知道这听起来不现实或可能不可行。以下是我要实现的目标。

让应用程序中设置的处理时间间隔为5分钟。 让记录为以下架构:

  {
       "type":"record",
       "name":"event",
       "fields":[
         { "name":"Student", "type":"string" },
         { "name":"Subject", "type":"string" } 
   ]}

我的流媒体应用程序应该考虑以下两个条件之一,将结果写入接收器。

  1. 如果学生的科目超过5个。 (优先考虑此标准。)

  2. 触发器中提供的处理时间已过期。

    private static Injection<GenericRecord, byte[]> recordInjection;
    private static StructType type;
    public static final String USER_SCHEMA = "{"
            + "\"type\":\"record\","
            + "\"name\":\"alarm\","
            + "\"fields\":["
            + "  { \"name\":\"student\", \"type\":\"string\" },"
            + "  { \"name\":\"subject\", \"type\":\"string\" }"
            + "]}";
    
    private static Schema.Parser parser = new Schema.Parser();
    
    private static Schema schema = parser.parse(USER_SCHEMA);
    
    static {
        recordInjection = GenericAvroCodecs.toBinary(schema);
        type = (StructType) SchemaConverters.toSqlType(schema).dataType();
    
    }
    sparkSession.udf().register("deserialize", (byte[] data) -> {
            GenericRecord record = recordInjection.invert(data).get();
            return RowFactory.create(record.get("student").toString(), record.get("subject").toString());
        }, DataTypes.createStructType(type.fields()));
    
    
    Dataset<Row> ds2 = ds1
            .select("value").as(Encoders.BINARY())
            .selectExpr("deserialize(value) as rows")
            .select("rows.*")
            .selectExpr("student","subject");
    
    StreamingQuery query1 = ds2
            .writeStream()
            .foreachBatch(
                new VoidFunction2<Dataset<Row>, Long>() {
                  @Override
                  public void call(Dataset<Row> rowDataset, Long aLong) throws Exception {
                    rowDataset.select("student,concat(',',subject)").alias("value").groupBy("student");
                  }
                }
            ).format("kafka")
            .option("kafka.bootstrap.servers", "localhost:9092")
            .option("topic", "new_in")
            .option("checkpointLocation", "checkpoint")
            .outputMode("append")
            .trigger(Trigger.ProcessingTime(10000))
            .start();
    query1.awaitTermination();
    

Kafka Producer控制台:

Student:Test, Subject:x
Student:Test, Subject:y
Student:Test, Subject:z
Student:Test1, Subject:x
Student:Test2, Subject:x
Student:Test, Subject:w
Student:Test1, Subject:y
Student:Test2, Subject:y
Student:Test, Subject:v

在Kafka用户控制台中,我期望如下所示。

Test:{x,y,z,w,v} =>This should be the first response 
Test1:{x,y} => second
Test2:{x,y} => Third by the end of processing time

0 个答案:

没有答案