Final Project

Open In Colab

Open In Colab

Overlapping Audio Features Determine Emotion and Gender Classification in Speech

Emily Graber

Abstract

Cochlear implant users often have difficulty distinguishing the gender and emotion of the person they are talking to on the phone when no visual cues are present. They feel a lack of control that builds over time. Doctors and cochlear implant manufacturers would like to find a solution to assist their patients’ and customers. Part of the trouble is that the microchips that do processing on the implants are only able to process very basic features of the audio, however the features that enocode emotion and gender cues may not be basic features. Here we will try to understand what might be going wrong on the audio chips. In a first step, we will take a computational approach to analyzing several audio files that contain sentences expressed with happiness and sadness by male and female speakers. The most salient pitch and amplitude were extracted from each file. Using only these two features to classify four categories of happy female, happy male, sad female, and sad male, the sentences are not correctly classified. This suggests that more than pitch and amplitude may be need to be processed in order to accurately classifying gender and emotion. Further development of audio processing algorithms that can capture more than pitch and amplitude would likely be helpful to untangle the differences in vocal samples and aid cochlear implant users.

Hypothesis

  • Basic audio features of pitch and amplitude are not sufficient to accurately classify audio samples

Methods

Data

Audio samples containing happy and sad speech from male and female speakers were downloaded from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The database overall contains 7356 files of 24 professional actors (12 female, 12 male), speaking two sentences with a number of different emotions and intensities. Only the happy and sad data from the actors were analyzed. These emotions were selected because they are typically easy to identity for normal-hearing subjects without using any visual cues. For cochlear implant users, identification based on audio alone is much more challenging.

Preprocessing Steps

The wave files from the dataset were downloaded and audio transformations were applied to extract the most salient amplitude and pitch from each file. One audio file is shown below along with transformations that facilitated the pitch and amplitude extraction process.

The most salient amplitude was read off of the raw audio waveform (Figure 1) while the most salient pitch was read off of a truncated spectral representation of the audio (Figure 3).

File 03.wav

The extracted pitch and amplitude data were stored in a new dataset. The original speaker gender and emotion were also recorded ito the new dataset. The new data set was then further processed.

    filenum  max_abs_amplitude  most_common_freq  vocal_gender  vocal_emotion
0         1             0.1270               128             0              1
1         2             0.2402               257             1              1
2         3             0.5960               281             0              1
3         4             0.2620               281             1              1
4         5             0.1550               234             0              1
5         8             0.6000               281             1              1
6         9             0.2000               468             0              1
7        15             0.1000               234             0              0
8        16             0.1900               234             1              0
9        17             0.3240               140             0              0
10       18             0.0420               656             1              0
11       19             0.0250               164             0              0
12       20             0.1430               563             1              0
13       21             0.1600               257             0              0
14       22             0.0680               210             1              0
15       23             0.0570               164             0              0

Processing Steps

  • TODO: describe major steps that you do to make sense of the data. Be specific enough so that someone could replicate your work in a different programming language. This means that you should not dictate what functions to use, rather the concepts to use.
  • Ideally this will be a story with multiple steps, e.g. raw data inspection, then filtering, then averaging, then comparing
  • Optional: Don’t forget about the possibility of doing correlation, linear regression, k-means clustering, or hypothesis testing

Purpose

  • TODO: Explain why doing the steps above will answer your question

Results and Discussion

  • TODO: visualize as many of the previously-mentioned steps as possible. Label everything. Break up this markdown cell as needed with code cells that generate plots. Use python to generate plots
  • TODO: state the mean and standard deviation in text for each major step that you take. Use python to compute those statistics
  • TODO: for each plot, either right before or right after, describe the plot without interpretation. Just describe what you see, any relevant colors or categories, etc.
  • TODO: after you describe the plot, interpret it. What does it mean and how does it connect back to your question?
  • TODO: before going to the next step, say why you are going to the next step, and how it connects back to your question or logically follows from the previous step.

Conclusions

  • TODO: repeat your main findings and why they matter
  • TODO: state the answer to your question
  • TODO: state any limitations or problems with the current results
  • TODO: overall, state how your result informs or helps others

Reference

  • TODO: cite one scholarly article that relates to your topic/methods/results

Appendix 1

  • This section documents your technical knowledge that is not part of your research story
  • TODO: describe the key features of your data, the type, the dimensions, and how you know the type and dimension
  • TODO: document the pre written python libraries that were used.
  • TODO: describe the key technical elements in your data visualizations, such as the type of plots, settings for the axes, symbol choices, color choices, legend usage, etc.

Appendix 2

  • TODO: explain logical filtering using an example from this course
  • TODO: explain standard deviation using an example from this course

New Data generated from original Data

Source: Overlapping Audio Features Determine Emotion and Gender Classification in Speech