CMU Researchers Collect Soccer, Bike Repair, Cooking Videos for Meta Dataset Ego-Exo Video Data To Help Model Activities From Different Perspectives

Aaron AupperleeThursday, November 30, 2023

SCS researchers captured videos from all angles for Ego-Exo4D, a foundational dataset and benchmark suite to support research on video learning and multimodal perception. As part of the project, RI Ph.D. student Shun Iwase (left) gathered soccer videos from Pittsburgh Riverhounds SC player Marc Ybarra.

From the pitch to the shop to the kitchen, Carnegie Mellon University researchers captured videos from all angles to help better model activities from different perspectives.

The work was part of Ego-Exo4D, a foundational dataset and benchmark suite to support research on video learning and multimodal perception. Ego-Exo4D is the result of a two-year effort by Meta's Fundamental Artificial Intelligence Research (FAIR), its Project Aria and 14 university partners.

Unique to the dataset is its simultaneous capture of both first-person egocentric views from a participant's wearable camera and multiple exocentric views from cameras surrounding the participant. The two perspectives are complementary. While the egocentric perspective reveals what the participant sees and hears, the exocentric views reveal the surrounding scene and context. Together, the two perspectives give AI models a new window into complex human skills. 

"By collecting both egocentric and exocentric views, the dataset can show researchers what activities look like from different perspectives and eventually help them develop computer vision algorithms that can recognize what a person is doing from any perspective," said Kris Kitani, an associate research professor at the Robotics Institute (RI) in CMU's School of Computer Science. "In the future, this footage and the algorithms it informs could help athletes train; create more accurate 3D renders of hands, arms and other body parts; and make human movements more lifelike in virtual reality."

Kitani worked with Shubham Tulsiani, an assistant professor in RI; Sean Crane, a research associate in RI who handled all the recording; and Rawal Khirodkar, a former robotics Ph.D. student who developed algorithms to extract 3D human poses for the dataset during his time at CMU.

CMU researchers collected videos of three activities: soccer, bike repairs and cooking. The researchers used a single wearable camera to capture the ego-view footage. Static cameras placed around the subject recorded the exo-view video. 

The researchers gathered soccer videos from CMU students and members of the Pittsburgh Riverhounds SC, a professional team playing in the USL Championship. Seasoned bike mechanics in Allegheny County contributed to the bike repair videos. A professional chef working in his kitchen recorded the cooking footage.

"The Ego-Exo4D dataset provides researchers with the tools needed to further explore the relationships between ego and exo videos and how they can be used to better develop multimodal activity recognition, virtual and augmented reality technologies, and interactive training and learning opportunities," Kitani said.

Working together as a consortium, FAIR or its university partners captured these perspectives with the help of more than 800 skilled participants in the United States, Japan, Colombia, Singapore, India and Canada. In December, the consortium will open source the data (including more than 1,400 hours of video) and annotations for novel benchmark tasks.

Ego-Exo4D constitutes the largest public dataset of time-synchronized first- and third- person video. Building this dataset required recruiting specialists across varying domains, bringing diverse groups of people together to create a multifaceted AI dataset. All scenarios feature real-world experts, where the camera-wearer has specific credentials, training, or expertise in the skill being demonstrated.

CMU has been a part of the EGO4D consortium for the last four years. The consortium, a collection of 14 universities and academic institutions brought together by Meta, focuses on the collection of egocentric video data to encourage more computer vision research for wearable cameras.

CMU researchers contributed more than 500 hours of footage to the original Ego4D dataset. The team gave landscapers, woodworkers, mechanics, artists, dog walkers, painters, contractors and other workers cameras. Participants near CMU's campus in Rwanda wore cameras to record cleaning, cooking, washing dishes, gardening and other tasks.

 More information about Ego-Exo4D is available in this blog post from Meta.

For More Information

Aaron Aupperlee | 412-268-9068 | aaupperlee@cmu.edu