Automatic Affective Virtual Environment Generation
Emotional Machines is an interactive media project that promotes the experience of affective virtual environments created through speech emotion recognition. In response to the existing limitations on emotion recognition models adopting computer vision or electrophysiological activity, whose sources are usually hidden by a head-mounted display, we propose the adoption of speech as the user interface input. In particular, our system uses two machine learning models to predict three main emotional categories from high-level semantic analysis and low-level acoustics speech features. Emotions predicted are mapped to audiovisual representations by an end-to-end process that encodes emotions in virtual environments. We adopt a generative model of chord progressions to transfer speech emotion into music based on the tonal interval space. Images are built using a generative adversarial text-to-image synthesis. The generated image is then used as the style image in the style-transfer process onto an equirectangular projection of a spherical panorama selected for each emotional category. The result is an immersive virtual space encapsulating emotions in spheres disposed into a 3D environment. Thus, users can create new affective representations or interact with other previously encoded instances using the joysticks.
KEYWORDS: Speech emotion recognition, intelligent virtual environments, affective computing, virtual reality, tonal interval space, machine learning.
PROJECT WEBPAGE: www.emotional-machines.com
Jorge Forero. University of Porto, Faculty of Engineering, ITI-LARSyS, INESC TEC.
Gilberto Bernardes. University of Porto, Faculty of Engineering, INESC TEC.
Mónica Mendes. University of Lisbon, Faculty of Arts, ITI-LARSyS.
The intersection between artificial intelligence and virtual environments has been designated as Intelligent Virtual Environments (IVEs) [1]. There has been an increasingly growing interest in developing IVEs in many fields. Remarkably, there has been a continuous interest in automatic speech recognition technologies for virtual reality applications [2][3].
Affective systems in VR involve developing at least two components: an emotion detection technique and a virtual environment generator [4].
Our contribution considers combining machine learning models to predict speech emotion from semantic and acoustic features. Text and speech emotion predictions are mapped into an audiovisual 3D environment represented with spheres encoding each user's experience (see Figure 1).
The user interface (UI) presented in figure 2 comprises a head-mounted display and two joysticks. The HMD is equipped with a lens, headphones, and a microphone. The device is connected to a Central Unit (CPU) where speech is processed. The UI also proposes publicly sharing a stream (on a screen or a projection) with the user's point of view. When there is no activity, the stream shows a preselected recorded performance.
Figure 3 shows the architecture of our system. It includes two main modules responsible for the speech emotion recognition system and the affective virtual environment generation.
We use Microsoft’s speech recognition application programming interface (API) to transfer speech-to-text. The semantic sentiment analysis of words or sentences uttered is classified as positive, negative, or neutral. The classifier uses the Rocchio Algorithm[5], implementing a word weighting method. Each of the three classes trains the algorithm with 100 annotated tweets. We propose a supervised machine-learning model for the acoustic counterpart to classify moods from the audio clips saved. To train the model, we use three datasets: a spontaneous SER database, a professionally acted database, and a non-verbal corpus containing annotated audio clips divided into four main emotional categories – neutral, happy, angry, and sad. A combination of acoustic features are extracted to train the model under a multilayer perceptron classifier. The semantic sentiment analysis and the acoustic machine learning models are merged to produce one of four predicted emotion categories. Predictions carried out are then encoded into a 3D virtual environment presented through a HMD.
To transfer speech emotion to music, we use a computational model to create chord progressions based on consonance and harmonic dispersion (i.e,distance-to-key) in the Tonal Interval Space [6]. The graphical environment is built using a text-to-image API provided by Scott Ellison Reed from DeepAI. This image is scaled and used as a style transferred over the texture of an equirectangular projection of a spherical panorama chosen for each emotional category. The goal is to synthesize a texture from a source image while constraining the texture synthesis to preserve the semantic content of a target image. The artistic result is an immersive virtual space encapsulating emotions in spheres disposed into a 3D world. Users can create new affective representations or interact with other previously encoded environments using the joysticks (see video).
Devices | Carried/Required |
Computer: Minimum Requirements.
| Carried |
Microphone | Carried |
VR HMD | Carried |
Public Streaming Minimum Requirements.
| Required |
The system requires a stable connection to the Internet to carry out the process.