Skip to main content
SearchLoginLogin or Signup

Accessible Co-Creativity with AI through Language and Voice Input

Published onAug 29, 2023
Accessible Co-Creativity with AI through Language and Voice Input


This project introduces a set of tools for humans to co-create music with AI-based architectures. The aim is for human musicians or non-musicians to be able to augment their creative capabilities and guide neural architectures to understand language and musical nuances. This work is akin to learning a new instrument: the human learns how to co-create with the AI system for musical applications. One of the main goals of the project is accessibility. We want people with various abilities and disabilities to be empowered to explore music creation by interacting with machines and using language and/or their voice. The tools we propose can be used for composition, improvisation, and/or performance. 


Language to Music Generation

As a preliminary stage of our work, we presented this project as a tool for intermedia performance during a series of eight telematic theater-music shows that took place at Stanford University in California and unteatru (theater venue) in Bucharest, Romania in Fall 2022. The performance, titled Prin pădure cu Maia (In the Forest with Maia), was originally presented as Lost Interferences and premiered at the 2021 Ars Electronica Festival, Kepler’s Gardens in Bucharest, and it was conceived by mixed-realities director Alexandru Berceanu in collaboration with composer Constantin Basica. The machine learning implementation was realized by AI researcher Prateek Verma, and it introduced the first tool in our project: language to music generation. We should mention from the beginning that our goal was very different from recently developed systems such as Google’s MusicLM or diffusion-based architectures which aim to generate music as response to text descriptors. In our case, we wanted to customize the music generation to the style of the composer. We also worked with MIDI instead of audio because we wanted to be able to play a Yamaha Disklavier (MIDI-enabled piano) that was part of the show’s narrative.

We first trained a neural network to understand the contents of a text by learning vector representation for each of the words that can be matched to that present in our dataset. We then used the proximity in the latent space of the words from a given new text to retrieve an audio chunk, which acts as a prompt to our music language model that was trained on improvised music. Thus, words or their synonyms can now trigger a melody by a composer that is present in the dataset, and they prompt a music LM to start composing. We then fed it twenty one-minute MIDI improvisations made by Basica while thinking of twenty keywords related to loss—the theme of the show (e.g., identity, memory, smell, limb, child, etc.).

With the system trained and deployed in the performance, we asked audience members at the theater venue in Bucharest to share stories about loss. Speaking into a microphone, their voice was sent to CCRMA in real time via JackTrip, translated to English live using Google Translate, and then piped into the AI system. The process was completely explained to the spectators andthey could see the text recognized as they were speaking. They system scanned the text for exact matches or synonyms of the twenty keywords and started generating new MIDI melodies based on the styles of improvisation corresponding to the detected words. The result was played by the Disklavier seconds after the end of the story spoken by the audience member and transmitted live back to Bucharest via screens and speakers. The live generated music at Stanford was also accompanied by body movement performed by actor Maia Morgenstern and some of the audience members in Bucharest. The spectators were invited to express emotion with their bodies based on theatrical improvisations on the theme of loss. The purpose of the music improvisation was to facilitate the spectators’s corporeal improvisation. Using AI to build improvisation created a gameful environment, helping to establish the idea that not only highly specialized artists such as Maia Morgenstern or the composer can improvise but also themselves.

An excerpt from one of the shows demonstrating the use of this system can be found here:


Language to Music Generation using MusicVAE

In the second iteration, we wanted to broaden the accessibility of our system as part of professor Patricia Alessandrini’s project Considering Disability in Online Cultural Experiences and the MuseIT project. The idea was to offer all people with the ability to use language, regardless of their musical experience, a tool that generates music based on their given text. We preserved the idea of a short story as input, but we use the retrieved prompts (direct or indirect word matches) to improvise MIDI music using a multitrack variational autoencoder (VAE) system developed by Google Magenta. MusicVAE can fill the gap between two given MIDI bars by generating interpolations of variable length and temperature.

Our system was trained on the MTG-Jamendo open dataset that contains 56,639 audio tracks with 57 “mood” tags such as wedding, angry, happy, ambient, lullaby, Christmas, dance, excited, sad, epic, narrative, etc. For any new audio file, the system is able to tag it according to the 57 tags. Moving to the MIDI domain—because the VAE architecture operates only in MIDI—we used 7,888 tracks from the open source Lakh MIDI dataset which are matched/aligned to audio tracks from the Million Song Dataset. Using three-second chunks of audio, our system tagged all tracks with the 57 moods, thus resulting in a library of identically tagged MIDI tracks. We can now use a tag to rank the 7,888 tracks based on a score given by the AI architecture. Upon receiving a prompt, the system selects the top corresponding 5% audio/MIDI-pair tracks and randomly extracts a segment from one of them. The system moves to the next detected word and repeats the process. Using the first pairs of words, we feed MusicVAE two bars of corresponding music which triggers the MIDI interpolation between them. The detection of each following word creates a new pair with the previous word so that the MIDI interpolations cascade into each other continuously. We envision multiple ways in which the user may choose to use this music co-created with the AI system: as a composition tool, it can be used to create material for new music, or it may inspire new ideas for melodies, rhythms, etc.; as a performance tool, it can be used as a standalone music generation system, or it could be used in interdisciplinary contexts (e.g., poets performing with musicians); as an improvisational tool, it may be used by people who cannot play an instrument, or to generate unexpected layers of music, etc. The advantage of MIDI is that it can be used with any synthesizer, VST plugin, MIDI-enabled piano, etc. In our case, we have used the newly generated MIDI material as an input to IRCAM’s Somax2 application to influence live improvisations.

The following link contains two examples of generated MIDI interpolations based on text prompts. It also includes a demonstration of using these MIDI interpolations to further improvise music with audio using Somax2.

As a next step, we want to allow users to add their own tracks to the library so that the music interpolations can be tailored to their musical preferences or even to their personal compositions. The issue that we are facing is that our system requires tagging based on audio tracks to be matched to analogous MIDI tracks. Converting audio to MIDI would be the simple response, but the accuracy of current conversion algorithms for multitrack MIDI is unsatisfactory. We are also planning to expand our system to accommodate other languages than English.

Voice to Music Generation using MusicVAE

Expanding on the previous tool, we further considered its accessibility and implemented a different mode of employing human voice. Instead of words, we invite people to sing or hum to drive the MusicVAE generation. Live audio input from a user is converted into MIDI using a model based on Tiago Fernandes Taveres’s monophonic audio to MIDI converter. Segments of the audio recording are converted to MIDI bars, then they are used to trigger the MIDI interpolations in the same manner as described above. The user may choose which bars are used for interpolation, or they may allow the system to randomly select them. As in the previous tool based on language, the MusicVAE interpolations can be adjusted by the user by inputting the number of bars and the temperature.

In March 2023, during the event Centering Disability in Online Musical Experiences organized by Patricia Alessandrini at CCRMA, we had the opportunity to collaborate with Alexander Brotzman, a semi-verbal participant with Profound Autism, who performed with and provided feedback on this system. His father Stephen Prutsman, a pianist and composer, facilitated and also participated in our sessions. Despite the participant's limited use of communication through speech, he was able to perform with the system by using his voice to create humming sounds. Our system recorded several seconds of his voice and generated MIDI music that was played back instantaneously on a Yamaha Disklavier. We also played back the audio recording in a loop to incite more sounds from the participant, whose father joined in with improvisation at the piano. An excerpt of this demonstration can be found here:

Voice to Music Retrieval

The last system in this series of accessible co-creative tools builds, again, on the previous one by using voice as input, but takes a different approach at the output by replacing music generation with retrieval. Extending a query-by-humming system, we allow any vocal sound (singing, humming, speaking, vocal noises) to be matched to a congruous musical segment. 

This system is not limited to the human voice, and can accept any audio as input, but for the purpose of this paper we limit our description to the voice. For our first demo, we used a library of 41 hours of piano music by various composers. Retrieval is achieved by proximity of the latent space of the query audio to the entire piano music library. The proximity is calculated by a simple Euclidean distance matrix, where the latent embeddings contain a summary of musical elements such as pitch, rhythm, melody, timbre, etc. and the non-musical sounds are ignored (silence, pedal noise, clapping, etc.). We invite the user to interact with the system by recording their voice and then selecting the name of a composer from the list, or searching the entire database. The system then finds the closest match to the vocal recording and plays it back.

In the same event mentioned above, our Autistic participant helped us test this system as well. His utterances were recorded and, with the help of his father, we selected names of composers as a filter for the music retrieval. The exciting part was to hear how some of the piano segments matched the vocal recording in a clear way, while others indicated a partial association with the vocal recording (e.g., rhythmic structure with no pitch correspondence). While the former can be more compelling in some cases, the latter suggests exciting possibilities of employing this system in performances where the machine understands and reflects the human voice in subtle ways, and invites the user to explore music with their voice. An excerpt of this demonstration can be found here:

In its current state, the system only records four seconds of audio and plays back the matched segment of music. But in a future version we want to develop this into a live system that can accept variable length audio input and a simultaneous recording/retrieval mode for continuous music playback. We also want to expand the library of music and add a feature to allow the user to select the type of instrument they want their voice to be matched to. Moreover, the goal of this system in the near future is to synthesize and generate original output in the style of the matched selection instead of simply playing it back, thus rendering this into a true tool for co-creation.


Composers, improvisers, and musicians in general have started to become more and more interested in co-creating with AI agents. Music improvisation by machines is not a novel topic. There are many cases of algorithmic systems that predate machine learning which allow computers to respond to musicians in real time: for instance, George Lewis’s Voyager, an interactive music software, or IRCAM’s Somax2, an application for musical improvisation and composition based on a generative model. However, with increasingly more complex AI models, we can now combine sensory modalities, art disciplines, and performance practices. 

We focused on music generation in the MIDI domain because of our project’s initial need to drive a self-playing piano. In subsequent iterations we continued using MIDI as audio-based music generation was not feasible under considerable resource constraints. We also found MIDI to be more flexible in terms of quick prototyping. However, we realize that, while MIDI is a good protocol for skilled musicians who may want to have precise control over the sound, it is also prohibitive to other users since it requires extra steps to achieve satisfactory audio output. We are planning to address this issue in future work to make our system available to even more people.

Our accessible tools for human-machine co-creativity probe the possibilities of bridging language, vocalization, and music using AI architectures. In live performances, we observed that participants, irrespective of their musical experience, were excited to be able to create and play complex music solely by using their voice. Our primary takeaway from this work is the potential of Artificial Intelligence tools to democratize music generation. AI-based co-creative systems can offer anyone—musicians and non-musicians, disabled and non-disabled—the opportunity to express their musical creativity. We also learned that AI architectures are incredibly powerful tools that can be molded into a variety of setups and inputs. We hope that our tools will be used by people with any background to spark new ideas and to aid music creation.


This project has received financial and institutional support from Stanford Human-Centered Artificial Intelligence (HAI), the Stanford Humanities Seed Funding Grant “Considering Disability in Online Cultural Experiences”; the MuseIT Project; the Center for Computer Research in Music and Acoustics (CCRMA); the REACH Project; the Institute for Research and Coordination in Acoustics/Music (IRCAM); and the Romanian Administration of the National Cultural Fund (AFCN).

No comments here
Why not start the discussion?