Ego4DSounds

About

Ego4DSounds is a subset of Ego4D, an existing large-scale egocentric video dataset. Ego4DSounds contains video clips spanning hundreds of different scenes and actions. Videos have a high action-audio correspondence, making it a high-quality dataset for action to sound generation. Clips have time-stamped narrations describing the actions performed by the camera-wearer.

3,000+

scenarios

Ego4DSounds contains scenes spanning a large array of daily activites, including cooking, cleaning, shopping, socializing, and more.

1.2 Million

video clips

All videos clips capture camera-wearers performing various actions with high audio-visual correspondence.

950+

hours

Ego4DSounds contains a large quantity of in-the-wild action clips from all over the world and text narrations.