This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes.
We make two main contributions.
The first contribution is a new dataset, created in a realistic 3D simulator,
which is based on real world trajectory data, and then
extrapolated by human annotators to achieve different latent goals.
This provides the first benchmark for quantitative evaluation
of the models to predict multi-future trajectories.
The second contribution is a new model to generate multiple plausible future trajectories,
which contains novel designs of using multi-scale location encodings and convolutional RNNs over graphs.
We refer to our model as Multiverse.
We show that our model achieves the best results
on our dataset, as well as on the real-world VIRAT/ActEV dataset
(which just contains one possible future).
High Definition Video Demonstration
The Forking Paths Dataset provides high resolution videos along with accurate bounding box and scene semantic segmentation annotations.
Demo Video
Scene Reconstruction from Real Videos
We recreate the static scene and dynamic trajectories from real-world videos.
Editing Interface
This is the editing interface. We can easily add, delete and edit agent behaviors in each scenario. We also provide pan-tilt-zoom operations so that we could easily find the desired camera view for recording videos.
Annotation Procedure
This is the human annotation procedure. Annotators are provided with a bird-eye view of the scene first and then they are asked to control the agent to go to the destination.