Reference for all DataStream transformations: map, flatMap, filter, keyBy, reduce, window, union, connect, process, and physical partitioning. Includes Java examples and notes on operator chaining.
Use this file to discover all available pages before exploring further.
Operators transform one or more DataStream instances into new ones. You chain them together to build your processing pipeline. Flink evaluates the entire chain lazily when you call env.execute().
Partitions a stream by key. All elements with the same key are routed to the same parallel task. keyBy returns a KeyedStream, which is required for keyed state and keyed operators like reduce and window.
Groups elements of a KeyedStream into windows. Windows collect elements and trigger computation when a condition is met (time elapsed, element count reached, etc.).
Opens windows on a non-keyed DataStream. All records are collected into a single task — this is non-parallel. Use it only for small-volume aggregations:
ProcessFunction gives you access to low-level runtime features: element timestamps, timers (event-time and processing-time), and side outputs. It is the most powerful but most verbose operator.Apply it to a keyed stream to access keyed state and register timers:
ProcessFunctionExample.java
import org.apache.flink.api.common.state.ValueState;import org.apache.flink.api.common.state.ValueStateDescriptor;import org.apache.flink.streaming.api.functions.KeyedProcessFunction;import org.apache.flink.util.Collector;public class AlertOnInactivity extends KeyedProcessFunction<String, Event, Alert> { private ValueState<Long> lastSeen; @Override public void open(OpenContext ctx) { lastSeen = getRuntimeContext().getState( new ValueStateDescriptor<>("last-seen", Long.class) ); } @Override public void processElement(Event event, Context ctx, Collector<Alert> out) throws Exception { lastSeen.update(ctx.timestamp()); // Register a timer 5 minutes from now ctx.timerService().registerEventTimeTimer(ctx.timestamp() + 300_000L); } @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<Alert> out) throws Exception { Long ts = lastSeen.value(); // If no new event arrived in the last 5 minutes, emit an alert if (ts != null && ts + 300_000L <= timestamp) { out.collect(new Alert(ctx.getCurrentKey(), "inactive for 5 minutes")); lastSeen.clear(); } }}stream.keyBy(Event::getUserId) .process(new AlertOnInactivity()) .print();
These operators control how data is distributed across parallel tasks without changing the stream’s type:
// Redistribute evenly round-robinstream.rebalance();// Random redistributionstream.shuffle();// Send all records to every downstream subtaskstream.broadcast();// Redistribute round-robin to a subset of downstream tasksstream.rescale();// Custom partitionerstream.partitionCustom( (key, numPartitions) -> key.hashCode() % numPartitions, event -> event.getCategory());
Flink automatically chains operators that have a 1-to-1 connection pattern (e.g., consecutive map calls). Chained operators run in the same thread and pass records directly without network serialization.To control chaining manually:
// Start a new chain at this operator — breaks the chain before itstream.map(f1).startNewChain().map(f2);// Disable chaining for a specific operatorstream.map(f1).disableChaining();// Disable chaining for the entire jobenv.disableOperatorChaining();
Assign names and descriptions to operators for easier debugging in the Flink Web UI:
stream .filter(e -> e.isValid()).name("filter-invalid") .map(e -> e.enrich()).name("enrich") .setDescription("Adds geo and user metadata to each event");