How can you handle late arriving data in spark streaming using
Use event-time watermarks with windowed aggregations , and for very late records, add a reprocessing/correction path. In Spark Structured Streaming, that is the standard way to handle late-arriving data because watermarks let Spark keep state only for a bounded delay, while window operations control how long results remain updateable.
Recommended approach
- Include an event-time column.
Late-data handling depends on an actual event timestamp, not just processing time.
- Apply a watermark.
Example:withWatermark("event_time", "10 minutes")tells Spark to accept records that arrive up to 10 minutes late and drop older ones from stateful aggregations.
- Use windowed aggregations if you need time-based results.
For example, group bywindow(event_time, "5 minutes"); Spark will maintain state for those windows until the watermark advances past them.
- Choose the output mode carefully.
appendis common for finalized window results, whileupdateis useful if you want intermediate values to change as more data arrives.
- Handle extremely late data separately.
If records arrive after the watermark, route them to a batch correction job or a recovery stream so you can recompute affected aggregates.
Example
python
from pyspark.sql.functions import col, window
events = stream_df.withWatermark("event_time", "10 minutes")
result = (
events
.groupBy(window(col("event_time"), "5 minutes"), col("event_type"))
.count()
)
Practical guidance
- Set the watermark based on real-world delay patterns, not a guess.
- Keep the watermark as small as possible to control memory usage, but large enough to capture normal delays.
- If correctness matters more than latency, pair streaming with periodic backfill/reconciliation.
If you want, I can turn this into a full PySpark Structured Streaming example with Kafka, watermarks, and late-data reprocessing.