I'd like to understand the following:
In Spark Structured streaming, there is the notion of trigger that says at which interval spark will try to read data to start a processing. What I would like to know is how long does the the readying operation may last ? In particular in the context of Kafka, what does exactly happens ? Let say, we have configured spark to retrieve the latest offsets always. What I want to know is, does Spark try to read an arbitrary amount of data (as in from where it last left off up to the latest offset available) on each trigger ? What if the readying operation is longer than the interval ? What is supposed to happen at that point ?
I wonder if there is a readying operation time that can be set, as in every trigger, keep readying for this amount of time? Or is the rate actually controlled in the two following ways: (1) Manually with maxOffsetsPerTrigger, and in that case, the trigger does not really matter, (2) Choose a trigger that make sense with respect to how much data you may have available and be able to process between triggers. The seconds options sounds quite difficult to calibrate.
If someone could try to clarify those for me, would be much appreciated.
Thanks