From Kalman Filters to Transformers

Chapter 1: Foundations in Probabilistic State Estimation

1.1 The Kalman Filter: A New Approach to Linear Filtering and Prediction Problems

The genesis of modern trajectory prediction and state estimation can be traced to a single, foundational publication: Kalman, R. E. (1960). "A New Approach to Linear Filtering and Prediction Problems." It provided an elegant and computationally efficient recursive solution to the discrete-data linear filtering problem. The Kalman filter uses a series of measurements observed over time, containing statistical noise, to produce estimates of unknown variables that are more accurate than those based on a single measurement alone.

1.2 Limitations and the Path Beyond Linearity

Despite its power, the classical Kalman filter's optimality is predicated on restrictive assumptions: the system must be linear, and the noise must follow a Gaussian distribution. Real-world motions are highly non-linear, especially during maneuvers. This limitation made it clear that to achieve the next leap in performance, the field had to pivot from explicitly modeling motion to implicitly learning behavior and interaction directly from data, setting the stage for the deep learning revolution.

Chapter 2: The Advent of Deep Learning and Recurrent Architectures

2.1 Modeling Trajectories as Sequences with LSTMs

The rise of deep learning shifted the paradigm from hand-crafted models to learning temporal patterns from data. LSTMs were a natural fit, processing a trajectory as a sequence of coordinates to encode the agent's dynamics implicitly.

2.2 Landmark Paper - "Social LSTM: Human Trajectory Prediction in Crowded Spaces"

This new paradigm was crystallized in a highly influential paper by Alahi, A., et al. (2016): "Social LSTM: Human Trajectory Prediction in Crowded Spaces." The paper's central innovation was the "Social Pooling" layer, a novel mechanism for sharing information between the LSTMs of nearby agents. It was the first deep learning model to successfully demonstrate joint reasoning about the paths of multiple interacting agents in a scene.

2.3 The "Averaging Problem": A Critical Flaw

Social LSTM was trained using L2 loss (Mean Squared Error), which fails in multimodal scenarios. To minimize the average error across multiple possible futures (e.g., turn left, turn right, go straight), the model learns to predict an "average" path that is often physically impossible and socially nonsensical. This failure motivated the next generation of research focused on generative models.

Chapter 3: Tackling Multimodality with Generative Models

3.1 Landmark Paper - "Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks"

The first and most influential work to apply the GAN framework was Gupta, A., et al. (2018): "Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks." It leverages an adversarial process where a Generator creates diverse trajectories and a Discriminator learns to distinguish them from real ones, forcing the generator to learn the distribution of socially acceptable motion.

3.2 Landmark Paper - "DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents"

An alternative generative framework was introduced by Lee, N., et al. (2017) in "DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents." This work presented a "generate-then-refine" architecture, using a Conditional VAE (CVAE) to generate diverse hypotheses which are then ranked and refined by a second module inspired by Inverse Optimal Control.

Chapter 4: Explicit Interaction Modeling with Graph Neural Networks

4.1 Landmark Paper - "Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network"

A pivotal moment came with Mohamed, A., et al. (2020) and their paper, "Social-STGCNN." This work demonstrated that recurrent architectures were not essential, introducing a fully graph-based model using Spatio-Temporal Graph Convolutions to capture both social and motion dynamics. It achieved state-of-the-art accuracy while being dramatically more efficient.

Chapter 5: The Transformer Era and the Dominance of Attention

5.1 Landmark Paper - "Scene Transformer: A unified architecture for predicting future trajectories"

One of the key works establishing the Transformer as the new state-of-the-art was Ngiam, J., et al. (2022) with "Scene Transformer." It ingests the entire scene at once—agents and road elements—and uses a novel "factored attention" to model interactions efficiently, producing jointly consistent future trajectories for all agents simultaneously.

5.2 The Current Frontier: Motion Transformer (MTR) and Beyond

The dominance of the Transformer was further solidified by models like the Motion Transformer (MTR), which won the 2022 Waymo Open Motion Prediction Challenge. Its key insight was to use a set of learnable "motion queries," where each query specializes in proposing a specific motion mode (e.g., left turn, lane change), leading to higher-quality multimodal predictions.