
A diverse egocentric dataset with high action-audio correspondence


Ego4DSounds is a subset of Ego4D, an existing large-scale egocentric video dataset. Ego4DSounds contains video clips spanning hundreds of different scenes and actions. Videos have a high action-audio correspondence, making it a high-quality dataset for action to sound generation. Clips have time-stamped narrations describing the actions performed by the camera-wearer.

Ego4DSounds contains scenes spanning a large array of daily activites, including cooking, cleaning, shopping, socializing, and more.
1.2 Million
video clips
All videos clips capture camera-wearers performing various actions with high audio-visual correspondence.
Ego4DSounds contains a large quantity of in-the-wild action clips from all over the world and text narrations.


puts the wheat down

drops the spanner on the table

cuts the grapes from the tree

cuts onion with a knife

opens the drawer

puts the glass in the cabinet

irons the trouser

pulls out the first dish rack from the dish dryer

rinses a pan with running water from a tap

vacuum cleans the stair

puts the egg shell in a sink

folds the cloth


Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Project Page


To use Ego4DSounds, we provide a GitHub Repository containing scripts and metadata

The metadata entry for each clip contains the video ID, duration, timestamps, and narrations

Ego4DSounds Data