Skip to main content
SearchLoginLogin or Signup

iː ɡoʊ weɪ

An artificially unintelligible exploration of voice

Published onAug 29, 2023
iː ɡoʊ weɪ
·

Abstract

This semi-improvised performance explores abstract voice sounds, in the tradition of dadaist phonetic poetry, following the trajectory of a performer whose voice and verbality is slowly escaping him. The performance begins from the text of possibly the most well-known dadaist sound poem, Kurt Schwitters' 'Die Sonata in Urlauten' (Ursonate), and expands upon this text with additional speech-sounds and fragments of language collaged in collaboration with a GPT-family text prediction model. As the performance progresses the performer’s voice is made increasingly distant from his own biological voice, augmented and transformed by real-time voice 'style transfer' models. The performer begins performing through a clone of his own voice that soon assimilates fragments of the voice of Jaap Blonk, one of the most celebrated living sound poets and performers of Ursonate (Blonk 2009). The performer's augmented voice moves onward, dissolving and coalescing into multi-human, polyphonic and choral forms and the ‘voices’ of non-human mammal and bird species.

The interest of this performance is to celebrate the unraveling of voice as a marker of identity. And to open up the concept of voice to whatever it may need to become.

Performance Setup

This performance uses a controllable system of four 'voice masks', performed through a combination direct voice-to-voice synthesis by similarity and direct manipulation of voice-to-voice mappings using a control interface. The system is built in SuperCollider, using RAVE real-time variational autoencoder models (Caillon and Esling 2021) for the core voice-to-voice synthesis. These models run within Victor Shepardson's RAVE-SuperCollider UGen (Shepardson [2022] 2023), a UGen that allows for loading multiple simultaneous models as well as provides access to the intermediary latent parameters involved in the encoding/decoding process of input voice to output voice.

A lightweight software abstraction of SuperCollider classes is created to be able to perform fluently with these models on stage. The abstractions facilitate spectral morphing between the different models, as well as between the model output and the raw, unprocessed voice of the performer. The abstractions also facilitates controls over scaling and bias values in the latent space of the models, making it possible to modulate the voice-to-voice synthesis in subtle to extreme ways, and even use the model without any voice input at all as if it were a stand-alone software synthesis instrument. All of these compositional dimensions are engaged in real-time using a combination of live coding and a stand-alone control interface.

Commentary

RAVE models roughly perform a kind of synthesis-by-similarity from the performer's voice to a specific audio corpus, resulting in a similar sonic output to what one might get from a concatenative synthesis system that analyses incoming audio and matches it against audio from the corpus according to hand-crafted audio features (Schwarz, Cahen, and Britton 2008; Tremblay, Roma, and Green 2021).

There are a few notable advantages of using RAVE models versus a fully realised concatenative synthesis system. One being the simplicity of development and use, as the RAVE training architecture plus SuperCollider UGen provides an out of the box synthesis-by-similarity system. It's true that one must still have the time and patience to train a model, and some basic knowledge of machine learning pipelines, but even this hurdle can be overcome by using one of the many pretrained models that have been made available by the open source neural audio synthesis community. Another advantage of using RAVE models is the way in which their variational autoencoder (VAE) architecture (Kingma and Welling 2013) learns unsupervised an extremely compressed, information-rich set of audio features, the so-called 'latent space' parameters. The musician gets a small number of control parameters with a large amount of expressive control, making possible hands-on and intuitive control over the sonic possibility space of the model with little need for expertise in audio feature engineering.

As far as this performance is concerned, RAVE models have the disadvantage of higher latency than most concatenative approaches. Because of the high computing power needed for a RAVE to work in realtime, they often require audio buffer sizes upwards of 2048 samples, leading to a latency between vocalising and hearing the transformed result. The latency can easily reach 200ms, which is the perceptual threshold for a vocalist where a delayed version of their voice can becoming disruptive to speech and language cognition (Stuart et al. 2002). Voice transformations that operate in this latency range can make rational control or pre-planning difficult for a performer without extensive practice, but can also be used to artistic and poetic effect, such as in the notable case of Nancy Holt and Richard Serra’s classic psychoacoustic performance piece Boomerang (Langley 2020). In recent months it is possible to achieve a more useable <200ms latency during performances, as various tricks - such as clever buffering of audio frames and gpu-enabled real-time inference - have been and are being developed by the neural audio synthesis community to try and mitigate the latency of relatively heavy neural models (ACIDS [2021] 2023; Shepardson [2022] 2023; Qosmo [2022] 2023).

Training Data Disclosure

The four vocal masks in this performance involve models trained on the following datasets: (1) a vocal dataset of the author's own voice / (2) a vocal dataset curated from recordings of Dutch sound poet Jaap Blonk's renditions of Kurt Schwitters' 'Die Sonata in Urlauten', provided with Jaap's blessings / (3) a hybrid dataset of choral recordings sourced from the author's own work and two open choral singing research datasets (Cuesta et al. 2018; Rosenzweig et al. 2020) / and (4) a dataset of the author's field recordings including: dogs barking, farm animals, monkeys and various bird species.

Name/Affiliation/Bio

Jonathan Chaim Reus is a transdisciplinary musician and artist known for his use of experimental technologies in performance. He was born in New York and thereafter lived in Amsterdam and then Florida, where he became involved in the American “new weird” folk-art movement. He later immigrated to the Netherlands and developed a uniquely intimate electronic sound practice combining improvisational approaches with traditional folk elements and futurist tendencies. He is co-founder of the instrument inventors initiative [iii] in the Hague, Netherlands Coding Live [nl_cl], and received a Fulbright Fellowship to research hybrid human-machine performance at the former Studio for Electro-Instrumental Music [STEIM] in Amsterdam.

Reus has received commissions as a composer and performance artist from Stedelijk Museum, Amsterdam, Slagwerk Den Haag percussion ensemble, and Asko-Schönberg contemporary music ensemble. Together with Sissel Marie Tonn he is one part of the artist duo Sensory Cartographies, whose wearable sound installation The Intimate Earthquake Archive, won honorable mention at Ars Electronica festival in 2020. In 2022 he received the CTM KONTINUUM commission for the year-long generative radio project »In Search of Good Ancestors / Ahnen in Arbeit«, airing on German and Austrian public radio stations throughout 2022.

He is currently a PhD candidate in music composition within the interdisciplinary "Sensation and Perception to Awareness" Leverhulme-funded doctoral programme at the University of Sussex.

Programme Notes

This semi-improvised performance explores abstract voice sounds following the trajectory of a performer whose voice and verbality is slowly escaping him. The performance begins from Kurt Schwitters' Ursonate, and expands upon this text with additional speech-sounds and fragments of language collaged in collaboration with a GPT-family text prediction model. As the performance progresses the performer’s voice is made increasingly distant from his own biological voice, finally dissolving into multi-human, polyphonic and choral forms and in the ‘voices’ of non-human mammal and bird species.

References

ACIDS. (2021) 2023. “RAVE: Realtime Audio Variational AutoEncoder Repository.” Python. https://github.com/acids-ircam/RAVE.

Blonk, Jaap. 2009. “About Kurt Schwitters’ Ursonate.” 2009. http://jaapblonk.com/Texts/ursonatewords.html.

Cuesta, Helena, Emilia Gómez, Agustín Martorell, and Felipe Loáiciga. 2018. “Choral Singing Dataset.” Zenodo. https://doi.org/10.5281/zenodo.1286570.

Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv. https://doi.org/10.48550/arXiv.1312.6114.

Langley, Patrick. 2020. “The Sound of the Self.” ArtReview. 2020. https://artreview.com/the-sound-of-the-self/.

Qosmo. (2022) 2023. “Neutone SDK.” Python. QosmoInc. https://github.com/QosmoInc/neutone_sdk.

Rosenzweig, Sebastian, Helena Cuesta, Christof Weiß, Frank Scherbaum, Emilia Gómez, and Meinard Müller. 2020. “Dagstuhl ChoirSet: A Multitrack Dataset for MIR Research on Choral Singing.” Transactions of the International Society for Music Information Retrieval 3 (1): 98–110. https://doi.org/10.5334/tismir.48.

Schwarz, Diemo, Roland Cahen, and Sam Britton. 2008. “Principles and Applications of Interactive Corpus-Based Concatenative Synthesis.” In Journées d’Informatique Musicale (JIM), 1–1. Albi, France. https://hal.science/hal-01161401.

Shepardson, Victor. (2022) 2023. “RAVE for SuperCollider Repository.” C++. https://github.com/victor-shepardson/rave-supercollider.

Stuart, Andrew, Joseph Kalinowski, Michael P. Rastatter, and Kerry Lynch. 2002. “Effect of Delayed Auditory Feedback on Normal Speakers at Two Speech Rates.” The Journal of the Acoustical Society of America 111 (5): 2237. https://doi.org/10.1121/1.1466868.

Tremblay, Pierre Alexandre, Gerard Roma, and Owen Green. 2021. “Enabling Programmatic Data Mining as Musicking: The Fluid Corpus Manipulation Toolkit.” Computer Music Journal 45 (2): 9–23. https://doi.org/10.1162/comj_a_00600.

Technical Rider

The performer will provide his own laptop + 2x microphones + audio interface. The performance requires a stereo PA system (ideally with subwoofer) and theatrical lighting suitable for an intimate solo performance.

The audio interface has two balanced jack outputs, one for each stereo channel, which should go to the PA system. Additionally required are two microphone stands (standing with boom) for the performance microphones, a small table suitable for holding a laptop + audio interface + midi controller, and a stool for sitting. The performer will need one power plug and will bring his own EURO power strip with a EURO-UK adapter to plug into.

Theme

AI Music Theater or AI Concert



Comments
0
comment
No comments here
Why not start the discussion?