Ego4DSounds

A diverse egocentric dataset with high action-audio correspondence

About

Ego4DSounds is a subset of Ego4D, an existing large-scale egocentric video dataset. Ego4DSounds contains video clips spanning hundreds of different scenes and actions. Videos have a high action-audio correspondence, making it a high-quality dataset for action to sound generation. Clips have time-stamped narrations describing the actions performed by the camera-wearer.

3,000+
scenarios
Ego4DSounds contains scenes spanning a large array of daily activites, including cooking, cleaning, shopping, socializing, and more.
1.2 Million
video clips
All videos clips capture camera-wearers performing various actions with high audio-visual correspondence.
950+
hours
Ego4DSounds contains a large quantity of in-the-wild action clips from all over the world and text narrations.

Preview

puts the wheat down

drops the spanner on the table

cuts the grapes from the tree

cuts onion with a knife

opens the drawer

puts the glass in the cabinet

irons the trouser

pulls out the first dish rack from the dish dryer

rinses a pan with running water from a tap

vacuum cleans the stair

puts the egg shell in a sink

folds the cloth

Publications

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Project Page

Download

To use Ego4DSounds, we provide a GitHub Repository containing scripts and metadata

The metadata entry for each clip contains the video ID, duration, timestamps, and narrations

Ego4DSounds Data