如何使用Java Spark结构化的流式API合并两个数据框,而又不添加索引并将其联接?

时间:2018-12-13 18:37:05

标签: spark-structured-streaming

我想合并来自不同数据框的不同列,避免添加索引并将其联接。

public static void count() throws StreamingQueryException {
    SparkSession session = SparkSession.builder().appName("streamFromKafka").master("local[*]").getOrCreate();

    Dataset<Row> df = session.readStream().format("kafka")
            .option("group.id","test-consumer-group")
            .option("kafka.bootstrap.servers", "localhost:9092")
            .option("subscribe", "test").load();

    df.printSchema();

    StructField word = DataTypes.createStructField("word", DataTypes.StringType, true);
    StructField timestamp = DataTypes.createStructField("tstamp", DataTypes.LongType, true);
    StructType schema = DataTypes.createStructType(Arrays.asList(word, timestamp));

    Dataset<Row> df1 = df.selectExpr("CAST(key AS STRING) as KEY", "CAST(value AS STRING) AS value", "CAST(timestamp AS TIMESTAMP) AS timestamp");

    Dataset<Row> df2 = df1.select(functions.from_json(functions.col("value"), schema).as("WORD"), functions.col("timestamp"));

    df2.printSchema();

    Dataset<Row> df3 = df2.selectExpr("WORD.word AS w", "timestamp");

    df3.printSchema();

    //Let's say we have to do some sort of mapping on column "w", for example, the following:

    //Dataset<Boolean> ttt = df3.repartition(1, df3.col("w"), df3.col("timestamp")).map(
    //  line-> {
    //      int number = Integer.parseInt(line.getString(0));
    //      return (number % 2 != 0);
    //  }, Encoders.BOOLEAN());

    //How can I merge ttt with df3["timestamp"] into odds?

    //Dataset<Row> odds = ...

    Dataset<Row> df4 = odds.groupBy(functions.window(odds.col("timestamp"), "10 seconds", "5 seconds"), odds.col("w")).count();

    StreamingQuery query1 = df4.writeStream().format("console").option("truncate", false).outputMode("complete").trigger(Trigger.ProcessingTime(10000)).start();

    query1.awaitTermination();


}

很显然,我可以按照here的方式添加索引和联接数据帧,或者可以注册一个用户定义的函数,该函数当然会作用于“ w”列。但是,我不想使用这些方法。我想知道在一般情况下如何合并两个数据框。

我尝试使用javaRDD或RDD,这似乎很舒适。但是,当进行以下任一调用时:

dataframe.toJavaRDD();
dataframe.rdd();

我收到以下异常:

  

具有流源的查询必须使用   writeStream.start();;

0 个答案:

没有答案