我想合并来自不同数据框的不同列,避免添加索引并将其联接。
public static void count() throws StreamingQueryException {
SparkSession session = SparkSession.builder().appName("streamFromKafka").master("local[*]").getOrCreate();
Dataset<Row> df = session.readStream().format("kafka")
.option("group.id","test-consumer-group")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test").load();
df.printSchema();
StructField word = DataTypes.createStructField("word", DataTypes.StringType, true);
StructField timestamp = DataTypes.createStructField("tstamp", DataTypes.LongType, true);
StructType schema = DataTypes.createStructType(Arrays.asList(word, timestamp));
Dataset<Row> df1 = df.selectExpr("CAST(key AS STRING) as KEY", "CAST(value AS STRING) AS value", "CAST(timestamp AS TIMESTAMP) AS timestamp");
Dataset<Row> df2 = df1.select(functions.from_json(functions.col("value"), schema).as("WORD"), functions.col("timestamp"));
df2.printSchema();
Dataset<Row> df3 = df2.selectExpr("WORD.word AS w", "timestamp");
df3.printSchema();
//Let's say we have to do some sort of mapping on column "w", for example, the following:
//Dataset<Boolean> ttt = df3.repartition(1, df3.col("w"), df3.col("timestamp")).map(
// line-> {
// int number = Integer.parseInt(line.getString(0));
// return (number % 2 != 0);
// }, Encoders.BOOLEAN());
//How can I merge ttt with df3["timestamp"] into odds?
//Dataset<Row> odds = ...
Dataset<Row> df4 = odds.groupBy(functions.window(odds.col("timestamp"), "10 seconds", "5 seconds"), odds.col("w")).count();
StreamingQuery query1 = df4.writeStream().format("console").option("truncate", false).outputMode("complete").trigger(Trigger.ProcessingTime(10000)).start();
query1.awaitTermination();
}
很显然,我可以按照here的方式添加索引和联接数据帧,或者可以注册一个用户定义的函数,该函数当然会作用于“ w”列。但是,我不想使用这些方法。我想知道在一般情况下如何合并两个数据框。
我尝试使用javaRDD或RDD,这似乎很舒适。但是,当进行以下任一调用时:
dataframe.toJavaRDD();
dataframe.rdd();
我收到以下异常:
具有流源的查询必须使用 writeStream.start();;