How to limit the size of the result dataset/dataframe table only to the current incoming trigger in Spark?

时间:2019-01-09 21:43:26

标签: apache-spark spark-structured-streaming

As per the documentation of spark-structured streaming in https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html, the rows from some intermediate state data are appended to the result table after processing a streaming query and the size of the result table keeps on increasing with each incoming batch.

I can verify this behavior by printing out the size of the result dataset as well.


    private static Map<String,String> map = new HashMap<>();
    static{
            map.put("mode", "FAILFAST");
            map.put("kafka.bootstrap.servers","localhost:9092");
            map.put("subscribe","test5");
            map.put("startingOffsets", "earliest");
            map.put("maxOffsetsPerTrigger","100");
    }
    public void exec(SparkSession sparkSession){
            Dataset<Row> dataSet= sparkSession.readStream().format("kafka").options(map).load();
            dataSet=dataSet.selectExpr("CAST(key AS STRING)");
            Dataset<Row> countQuery=receievedMessageDataset.selectExpr("COUNT(key)");
            StreamingQuery sq1= countQuery.writeStream().format("console").outputMode("append").start();
            try{
                sq1.awaitTermination(10000);
            }catch (Exception e){
                e.printStackTrace();


}    
}

-------------------------------------------
Batch: 0
-------------------------------------------
+----------+
|count(key)|
+----------+
|        99|
+----------+
-------------------------------------------
Batch: 1
-------------------------------------------
+----------+
|count(key)|
+----------+
|       198|
+----------+

I am working on a project which reads data from a streaming source ,does some transformations only on the current batch trigger and publishes it out. So I want to restrict the data of the result table only to the processed output of the current processing batch and shouldn't contain the contents of the previous batch as this might as well be an issue if the size of the result table exceeds the available memory. How can I achieve this behavior?

0 个答案:

没有答案