CLARA: ADVANCING MACHINES IN UNDERSTANDING SPEECH NUANCES

by: Knoriy, Christoph, Robert, 16 Oct, 2023


Voices carry not only words but also convey emotions, emphasis, and nuance through aspects like tone and accent. However, existing speech technology only partially comprehends these intricate components of human speech.

Introducing CLARA (Multilingual Contrastive Learning for Audio Representation Acquisition) – a project designed to enhance machines' understanding of the implicit aspects of speech. It aspires to develop a comprehensive pre-trained model dedicated to auditory communication.

Ready to Make a Difference?
Join our Open Empathic Discord server!
Explore our CLARA project on GitHub and its interactive Demo Notebook. The corresponding research article can be found on arxiv.

The Driving Force Behind CLARA

Building multilingual models from speech data poses significant challenges. We harness large, unlabeled datasets to derive transferable representations across various languages by utilising contrastive self-supervised learning.

We've all encountered stilted and unempathetic interactions with automated systems oblivious to our emotions and expressions. CLARA seeks to transform this by empowering machines to discern the full context of a speaker's voice. We aim to facilitate more natural and emotionally intuitive interactions with conversational AI, like virtual assistants and video game characters. The potential applications range from creating empathetic healthcare assistants to immersive gaming experiences.

Our Approach

Employing the latest advancements in contrastive learning, CLARA is trained on an extensive dataset of unlabeled multilingual speech recordings. It learns to identify linguistic and paralinguistic attributes by analysing and contrasting various examples.

We've amassed over 100,000 hours of speech data in numerous languages and applied augmentation strategies, such as introducing background noise and altering pitch, to enhance the training dataset's diversity. CLARA links audio and text encoders in its model architecture to map different modalities into a unified representation space, and we are exploring various configurations to optimise its performance.

Early Achievements

Preliminary findings indicate CLARA's capabilities in multilingual speech representation learning. In its initial tests, the model showcased competitive or leading performance in cross-lingual emotion recognition using the CREMA-D benchmark.

For instance, in the CREMA-D dataset for emotional English speech, CLARA achieved an 88.44% accuracy using a linear probe classifier, nearing the existing best accuracy of 95.17% derived from a fully supervised and labelled model. This accomplishment is attained through self-supervised pre-training with unlabeled data and minimal subsequent tuning.

CLARA also adeptly generalises to unfamiliar languages. Testing for emotion recognition in French, using the EmoV-DB dataset, achieved a 75.2% accuracy using a linear probe. Moreover, its transfer learning capabilities showcase a rapid adaptation to new languages and tasks by leveraging its foundational representations.

To fully unlock CLARA's potential, we're concentrating on enhancing the diversity and volume of the training data, especially in languages with limited resources. Your contributions to our multilingual datasets will be pivotal in achieving universal speech understanding.

Be Part of the Open Empathic Mission

Your participation can propel this research forward. We're actively seeking contributions to enrich the project, particularly in diverse emotional styles, accents, and languages.

Open Empathic is a non-profit, open-source initiative. If you wish to contribute or collaborate, please connect with us! Together, we can forge ahead in enabling machines to comprehend and interact with us in more human-like ways.

Are you ready to help machines truly comprehend our speech? Let's make it happen together.

Acknowledgement

We would like to thank Stability AI for their generous support in providing the essential compute resources for this project.