Facebook wants AI to find your keys and understand your conversations
- Written by Jumana Abu-Khalaf, Research Fellow in Computing and Security, Edith Cowan University
Facebook has announced a research project that aims to push the “frontier of first-person perception”, and in the process help you remember where your left your keys.
The Ego4D project provides a huge collection of first-person video and related data, plus a set of challenges for researchers to teach computers to understand the data and gather useful information from it.
In September, the social media giant launched a line of “smart glasses” called Ray-Ban Stories, which carry a digital camera and other features. Much like the Google Glass project, which met mixed reviews in 2013, this one has prompted complaints of privacy invasion.
The Ego4D project aims to develop software that will make smart glasses far more useful, but may in the process enable far greater breaches of privacy.
Read more: Ray-Ban Stories let you wear Facebook on your face. But why would you want to?
What is Ego4D?
Facebook describes the heart of the project as
a massive-scale, egocentric dataset and benchmark suite collected across 74 worldwide locations and nine countries, with over 3,025 hours of daily-life activity video.
The “Ego” in Ego4D means egocentric (or “first-person” video), while “4D” stands for the three dimensions of space plus one more: time. In essence, Ego4D seeks to combine photos, video, geographical information and other data to build a model of the user’s world.
There are two components: a large dataset of first-person photos and videos, and a “benchmark suite” consisting of five challenging tasks that can be used to compare different AI models or algorithms with each other. These benchmarks involve analysing first-person video to remember past events, create diary entries, understand interactions with objects and people, and forecast future events.
The dataset includes more than 3,000 hours of first-person video from 855 participants going about everyday tasks, captured with a variety of devices including GoPro cameras and augmented reality (AR) glasses. The videos cover activities at home, in the workplace, and hundreds of social settings.
What is in the data set?
Although this is not the first such video dataset to be introduced to the research community, it is 20 times larger than publicly available datasets. It includes video, audio, 3D mesh scans of the environment, eye gaze, stereo, and synchronized multi-camera views of the same event.
Most of the recorded footage is unscripted or “in the wild”. The data is also quite diverse as it was collected from 74 locations across nine countries, and those capturing the data have various backgrounds, ages and genders.
What can we do with it?
Commonly, computer vision models are trained and tested on annotated images and videos for a specific task. Facebook argues that current AI datasets and models represent a third-person or a “spectator” view, resulting in limited visual perception. Understanding first-person video will help design robots that better engage with their surroundings.
Authors: Jumana Abu-Khalaf, Research Fellow in Computing and Security, Edith Cowan University