Use event-time watermarks with windowed aggregations , and for very late records, add a reprocessing/correction path. In Spark Structured Streaming, that is the standard way to handle late-arriving data because watermarks let Spark keep state only for a bounded delay, while window operations control how long results remain updateable.

Recommended approach

  1. Include an event-time column.
    Late-data handling depends on an actual event timestamp, not just processing time.
  1. Apply a watermark.
    Example: withWatermark("event_time", "10 minutes") tells Spark to accept records that arrive up to 10 minutes late and drop older ones from stateful aggregations.
  1. Use windowed aggregations if you need time-based results.
    For example, group by window(event_time, "5 minutes"); Spark will maintain state for those windows until the watermark advances past them.
  1. Choose the output mode carefully.
    append is common for finalized window results, while update is useful if you want intermediate values to change as more data arrives.
  1. Handle extremely late data separately.
    If records arrive after the watermark, route them to a batch correction job or a recovery stream so you can recompute affected aggregates.

Example

python

from pyspark.sql.functions import col, window

events = stream_df.withWatermark("event_time", "10 minutes")

result = (
    events
    .groupBy(window(col("event_time"), "5 minutes"), col("event_type"))
    .count()
)

Practical guidance

  • Set the watermark based on real-world delay patterns, not a guess.
  • Keep the watermark as small as possible to control memory usage, but large enough to capture normal delays.
  • If correctness matters more than latency, pair streaming with periodic backfill/reconciliation.

If you want, I can turn this into a full PySpark Structured Streaming example with Kafka, watermarks, and late-data reprocessing.