跳至正文

ESC-50: Dataset for Environmental Sound Classification

Overview | Download | Results | Repository content | License | Citing | Caveats | Changelog

\"\"/
\"\"/
\"Download\"/
\"ESC-50

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:

AnimalsNatural soundscapes & water sounds Human, non-speech soundsInterior/domestic soundsExterior/urban noises
DogRainCrying babyDoor knockHelicopter
RoosterSea wavesSneezingMouse clickChainsaw
PigCrackling fireClappingKeyboard typingSiren
CowCricketsBreathingDoor, wood creaksCar horn
FrogChirping birdsCoughingCan openingEngine
CatWater dropsFootstepsWashing machineTrain
HenWindLaughingVacuum cleanerChurch bells
Insects (flying)Pouring waterBrushing teethClock alarmAirplane
SheepToilet flushSnoringClock tickFireworks
CrowThunderstormDrinking, sippingGlass breakingHand saw

Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project. The dataset has been prearranged into 5 folds for comparable cross-validation, making sure that fragments from the same original source file are contained in a single fold.

A more thorough description of the dataset is available in the original paper with some supplementary materials on GitHub: ESC: Dataset for Environmental Sound Classification – paper replication data.

Download

The dataset can be downloaded as a single .zip file (~600 MB):

Download ESC-50 dataset

Results

Numerous machine learning & signal processing approaches have been evaluated on the ESC-50 dataset. Most of them are listed here. If you know of some other reference, you can message me or open a Pull Request directly.

Terms used in the table:

• CNN – Convolutional Neural Network
• CRNN – Convolutional Recurrent Neural Network
• GMM – Gaussian Mixture Model
• GTCC – Gammatone Cepstral Coefficients
• GTSC – Gammatone Spectral Coefficients
• k-NN – k-Neareast Neighbors
• MFCC – Mel-Frequency Cepstral Coefficients
• MLP – Multi-Layer Perceptron
• RBM – Restricted Boltzmann Machine
• RNN – Recurrent Neural Network
• SVM – Support Vector Machine
• TEO – Teager Energy Operator¨C24C• ZCR – Zero-Crossing Rate

TitleNotesAccuracyPaperCode
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound ClassificationCNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies86.50%sailor2017
Learning from Between-class Examples for Deep Sound RecognitionEnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning84.90%tokozume2017b
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound ClassificationCNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies84.15%tak2017
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and ScenesCNN pretrained on AudioSet83.50%kumar2017:scroll:
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound ClassificationCNN with filterbanks learned using convolutional RBM + fusion with GTSC83.00%sailor2017
Novel TEO-based Gammatone Features for Environmental Sound ClassificationFusion of GTSC & TEO-GTSC with CNN81.95%agrawal2017
Learning from Between-class Examples for Deep Sound RecognitionEnvNet-v2 (tokozume2017a) + Between-Class learning81.80%tokozume2017b
:headphones: Human accuracyCrowdsourcing experiment in classifying ESC-50 by human listeners81.30%piczak2015a:scroll:
Objects that SoundLook, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule79.80%arandjelovic2017b
Look, Listen and Learn8-layer convolutional subnetwork pretrained on an audio-visual correspondence task79.30%arandjelovic2017a
Learning Environmental Sounds with Multi-scale Convolutional Neural NetworkMulti-scale convolutions with feature fusion (waveform + spectrogram)79.10%zhu2018
Novel TEO-based Gammatone Features for Environmental Sound ClassificationGTSC with CNN79.10%agrawal2017
Learning from Between-class Examples for Deep Sound RecognitionEnvNet-v2 (tokozume2017a) + data augmentation78.80%tokozume2017b
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound ClassificationCNN with filterbanks learned using convolutional RBM78.45%sailor2017
Learning from Between-class Examples for Deep Sound RecognitionBaseline CNN (piczak2015b) + Batch Normalization + Between-Class learning76.90%tokozume2017b
Novel TEO-based Gammatone Features for Environmental Sound ClassificationTEO-GTSC with CNN74.85%agrawal2017
Learning from Between-class Examples for Deep Sound RecognitionEnvNet-v2 (tokozume2017a)74.40%tokozume2017b
Soundnet: Learning sound representations from unlabeled video8-layer CNN (raw audio) with transfer learning from unlabeled videos74.20%aytar2016:scroll:
Learning from Between-class Examples for Deep Sound Recognition18-layer CNN on raw waveforms (dai2016) + Between-Class learning73.30%tokozume2017b
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound ClassificationCNN working with phase encoded mel filterbank energies (PEFBEs)73.25%tak2017
Classifying environmental sounds using image recognition networks16 kHz sampling rate, GoogLeNet on spectrograms (40 ms frame length)73.20%boddapati2017:scroll:
Learning from Between-class Examples for Deep Sound RecognitionBaseline CNN (piczak2015b) + Batch Normalization72.40%tokozume2017b
Novel TEO-based Gammatone Features for Environmental Sound ClassificationFusion of MFCC & TEO-GTCC with GMM72.25%agrawal2017
Learning environmental sounds with end-to-end convolutional neural network (EnvNet)Combination of spectrogram and raw waveform CNN71.00%tokozume2017a
Novel TEO-based Gammatone Features for Environmental Sound ClassificationTEO-GTCC with GMM68.85%agrawal2017
Classifying environmental sounds using image recognition networks16 kHz sampling rate, AlexNet on spectrograms (30 ms frame length)68.70%boddapati2017:scroll:
Very Deep Convolutional Neural Networks for Raw Waveforms18-layer CNN on raw waveforms68.50%dai2016, tokozume2017b:scroll:
Classifying environmental sounds using image recognition networks32 kHz sampling rate, GoogLeNet on spectrograms (30 ms frame length)67.80%boddapati2017:scroll:
WSNet: Learning Compact and Efficient Networks with Weight SamplingSoundNet 8-layer CNN architecture with 100x model compression66.25%jin2017
Soundnet: Learning sound representations from unlabeled video5-layer CNN (raw audio) with transfer learning from unlabeled videos66.10%aytar2016:scroll:
WSNet: Learning Compact and Efficient Networks with Weight SamplingSoundNet 8-layer CNN architecture with 180x model compression65.80%jin2017
Soundnet: Learning sound representations from unlabeled video5-layer CNN trained on raw audio of ESC-50 only65.00%aytar2016:scroll:
:bar_chart: Environmental Sound Classification with Convolutional Neural NetworksCNN baselineCNN with 2 convolutional and 2 fully-connected layers, mel-spectrograms as input, vertical filters in the first layer64.50%piczak2015b:scroll:
auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural NetworksMLP classifier on features extracted with an RNN autoencoder64.30%freitag2017:scroll:
Classifying environmental sounds using image recognition networks32 kHz sampling rate, AlexNet on spectrograms (30 ms frame length)63.20%boddapati2017:scroll:
Classifying environmental sounds using image recognition networksCRNN60.30%boddapati2017:scroll:
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks3-layer CNN with vertical filters on wideband mel-STFT (median accuracy)56.37%huzaifah2017
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks3-layer CNN with square filters on wideband mel-STFT (median accuracy)54.00%huzaifah2017
Soundnet: Learning sound representations from unlabeled video8-layer CNN trained on raw audio of ESC-50 only51.10%aytar2016:scroll:
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks5-layer CNN with square filters on wideband mel-STFT (median accuracy)50.87%huzaifah2017
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks5-layer CNN with vertical filters on wideband mel-STFT (median accuracy)46.25%huzaifah2017
:bar_chart: Baseline – random forestBaseline ML approach (MFCC & ZCR + random forest)44.30%piczak2015a:scroll:
Soundnet: Learning sound representations from unlabeled videoConvolutional autoencoder trained on unlabeled videos39.90%aytar2016:scroll:
:bar_chart: Baseline – SVMBaseline ML approach (MFCC & ZCR + SVM)39.60%piczak2015a:scroll:
:bar_chart: Baseline – k-NNBaseline ML approach (MFCC & ZCR + k-NN)32.20%piczak2015a:scroll:
A mixture model-based real-time audio sources classification methodDictionary of sound models used for classification (accuracy is computed on segments instead of files)94.00%baelde2017
NELS – Never-Ending Learner of SoundsLarge-scale audio crawling with classifiers trained on AED datasets (including ESC-50)N/Aelizalde2017:scroll:
Utilizing Domain Knowledge in End-to-End Audio ProcessingEnd-to-end CNN with learned mel-spectrogram transformationN/Atax2017:scroll:
Deep Neural Network based learning and transferring mid-level audio features for acoustic scene classificationTransfer learning from various datasets, including ESC-50N/Amun2017
Features and Kernels for Audio Event RecognitionMFCC, GMM, SVMN/Akumar2016b
A real-time environmental sound recognition system for the Android OSReal-time sound recognition for Android evaluated on ESC-10N/Apillos2016
Comparing Time and Frequency Domain for Audio Event Recognition Using Deep LearningDiscriminatory effectiveness of different signal representations compared on ESC-10 and Freiburg-106N/Ahertel2016
Audio Event and Scene Recognition: A Unified Approach using Strongly and Weakly Labeled DataCombination of weakly labeled data (YouTube) with strong labeling (ESC-10) for Acoustic Event DetectionN/Akumar2016a

Repository content

  • audio/*.wav 2000 audio recordings in WAV format (5 seconds, 44.1 kHz, mono) with the following naming convention: {FOLD}-{CLIP_ID}-{TAKE}-{TARGET}.wav
  • {FOLD} – index of the cross-validation fold,
  • {CLIP_ID} – ID of the original Freesound clip,
  • {TAKE} – letter disambiguating between different fragments from the same Freesound clip,
  • {TARGET} – class in numeric format [0, 49].
  • meta/esc50.csv CSV file with the following structure: filename fold target category esc10 src_file take The esc10 column indicates if a given file belongs to the ESC-10 subset (10 selected classes, CC BY license).
  • meta/esc50-human.xlsx Additional data pertaining to the crowdsourcing experiment (human classification accuracy).

License

The dataset is available under the terms of the Creative Commons Attribution Non-Commercial license.

A smaller subset (clips tagged as ESC-10) is distributed under CC BY (Attribution).

Attributions for each clip are available in the LICENSE file.

Citing

\"Download

If you find this dataset useful in an academic setting please cite:

K. J. Piczak. ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd Annual ACM Conference on Multimedia, Brisbane, Australia, 2015.

[DOI: http://dx.doi.org/10.1145/2733373.2806390]

@inproceedings{piczak2015dataset,
  title = {{ESC}: {Dataset} for {Environmental Sound Classification}},
  author = {Piczak, Karol J.},
  booktitle = {Proceedings of the 23rd {Annual ACM Conference} on {Multimedia}},
  date = {2015-10-13},
  url = {http://dl.acm.org/citation.cfm?doid=2733373.2806390},
  doi = {10.1145/2733373.2806390},
  location = {{Brisbane, Australia}},
  isbn = {978-1-4503-3459-4},
  publisher = {{ACM Press}},
  pages = {1015--1018}
}

Caveats

Please be aware of potential information leakage while training models on ESC-50, as some of the original Freesound recordings were already preprocessed in a manner that might be class dependent (mostly bandlimiting). Unfortunately, this issue went unnoticed when creating the original version of the dataset. Due to the number of methods already evaluated on ESC-50, no changes rectifying this issue will be made in order to preserve comparability.

Changelog

v2.0.0 (2017-12-13)

• Change to WAV version as default.

v2.0.0-pre (2016-10-10) (wav-files branch)

• Replace OGG recordings with cropped WAV files for easier loading and frame-level precision (some of the OGG recordings had a slightly different length when loaded).
• Move recordings to a one directory structure with a meta CSV file.

v1.0.0 (2015-04-15)

• Initial version of the dataset (OGG format).