Interactive Rhythm Generation through Musical Actions
This paper introduces our tool Latent Space Explorer for interactively exploring latent spaces in generative Variational Autoencoders (VAE). We propose a method for musical exploration of the latent space using a separate exploration model to bridge the gap between musical actions and generated rhythms. The tool enables musicians to navigate the position in the latent space through various musical parameters currently derived from live audio input. Initial experiments with robotic percussion instruments and an alto saxophone demonstrate the musical potential of the approach.
Using Variational Autoencoders (VAE) for generating rhythms has been well established during the last years (e.g. ,). The underlying architecture allows straightforward access to all potentially generatable rhythms via their representation in a latent space. For more control on the generated sequences, approaches for visualising the latent space distribution have been made. While these seem very suitable to the degree of control needed in cases like music production, it might be interesting to have an improvisational access to the model‘s creative potential in dialogic exploration  rather than exact control. Therefore we propose a tool for a musical exploration of latent spaces with a separate exploration model used to bridge the gap between musical actions and generated rhythms.
At the current early stage of this ongoing project our main research question is, how a latent space representation for rhythm generation can be explored interactively through the live audio signal of a musical instrument and what features are relevant for satisfactory collaborative interaction.
The generation of rhythms initiates inquiries regarding temporal resolution, specifically the granularity of the temporal grid utilized for creating rhythmic figures. These figures exhibit significant variations across diverse musical genres. The inherently imperfect nature of human performance on the prescribed time grid introduces microscopic inaccuracies, referred to as micro timing, which can be either intentional or characteristic of a particular style. The velocity at which notes are played constitutes another vital factor, exerting a profound influence on the rhythmic character. Additionally, the quantity of percussive elements employed significantly contributes to the complexity of rhythms through their interactive dynamics.
In the design of AI-based rhythm generation tools, it becomes imperative to consider these aforementioned aspects. Some existing tools partially address these factors. Moreover, in the context of AI-based tools, training and the associated dataset present additional considerations. Determining the requisite computational resources for training a Variational Autoencoder (VAE) for rhythm generation, as well as identifying the necessary dataset size and composition, become pertinent concerns. Consequently, when envisioning a rhythm generation tool intended to be widely accessible while allowing customization, addressing these fundamental questions becomes paramount.
MusicVAE , developed by the Google Magenta team, stands out as one of the pioneering applications of Variational Autoencoders (VAE) in the field of music generation. However, tools like MusicVAE, DrumNet  or GrooVAE  were not designed to be trained and adapted by musicians themselves. GrooVAE was previously trained with a large data set including 13 hours of MIDI and audio from the Groove data set . MusicVAE was trained with a large data set of about 1.5 million individual MIDI clips .
M4L.RhythmVAE is a Max4Live device (plugin) for Ableton Live and was presented by Tokui in 2020 . It is based on GrooVAE but designed with a simpler network architecture. In general, M4L.RhythmVAE takes into account onsets, velocity and micro timing . Furthermore, the tool can be trained individually by the user with comparatively very small data sets and it is optimized for realtime operations.
The distribution of the learned rhythms in the latent space is not visualized. Also, triplets cannot be generated explicitly. However, these can be created through microtiming parameters. For each note that does not have an onset exactly on the grid, an offset from the grid is specified and fed into the neural network. From this point of view a triplet is basically nothing more than a sixteenth note with a certain deviation from the sixteenth grid.
Vigliensoni et al.  released R-VAE in 2022, a tool based on a Variational Autoencoder (VAE), which encodes musical rhythms in a two-dimensional, latent space. This tool is based on the M4L.RhythmVAE with improvements for meter and visualization. Due to the fine temporal resolution of 32nd triplets, R-VAE is able to explicitly consider simple and compound rhythms. This helps a lot when encoding and decoding different genres. A visualization for the dynamic representation of the latent space was also implemented. In this way, the location and temporal occurrence of a percussive element inside the latent space can be traced. The possibility of using small data sets increased the human influence on the learning process. This allows artists to better understand how to edit and influence a model, which in turn leads to an increased sense of ownership.
The generated output depends on the size of the data sets and the number of epochs during the training process. Vigliensoni et al.  conducted experiments with data sets that were a dozen midi-files in size. At 10 epochs there is a lot of noise in the latent space. This decreases from 100 epochs and has completely disappeared at 1000 epochs. However, 1000 epochs result in over-adapted models in which the rooms and zones are cleanly separated and the transitions are very abrupt. Transitions also depend on the rhythms themselves. The more similar or related the rhythms are, the smoother the transitions.
Why should artists and composers use tools based on VAE for their work? Yee King  proposed a creative approach to latent spaces in the context of working with large datasets of creative material, such as scores, images, or sounds. This involves two key actions: search and generate. Several search methods, including those reliant on metadata, content-based approaches, and feature extraction techniques, exhibit inherent limitations. However, by employing VAE-based tools, these limitations are overcome . Thus enabling artists and composers to explore and discover meaningful patterns within the vast array of creative material.
In the case of this paper and tools like M4L.RhythmVAE or R-VAE the approach to the latent space is different. Not only the size of the dataset is considered small and customizable instead of large and impersonal, but also the use case for realtime operations changes the perspective to and the purpose of the latent space. Artists engage with the latent space beyond mere search for inspiring ideas; they actively interact with it, requiring immediate responsiveness to the output. The subsequent chapter presents an approach to exploring the latent space through musical input.
Our tool Latent Space Explorer1 is realized as an additional Max4Live device and intended to accompany a rhythm generating device. The mapping of the xy-parameters is not restricted to a certain device but can be individually assigned the output parameters of any other device or plugin.
Part A of Figure 1 shows a rhythm generating VAE encoding its contained rhythmic material into a 2-dimensional latent space representation.
Part B visualizes the decoding part of the VAE, where a new rhythm is generated from the latent space position.
Part C shows the Latent Space Explorer signal flow and where it interferes with the VAE.
In simple terms, the Latent Space Explorer translates various musical parameters into a two-dimensional space (cf. Figure 1). There are currently 10 input parameters: ambitus, sound density, loudness measured in RMS and seven spectral parameters that are extracted by the FluCoMa object spectralshape  from a live audio signal.
These input parameters are fed into a neural network, which translates the 10 input parameters into 4 output parameters. The first 2 output parameters are displayed on an XY coordinate system, which in the next step controls the latent space of the M4L.RhythmVAE device. The other two parameters control the noise of the device and the so-called Z-ramp-time. The latter allows control over the transition time between two consecutive parameter values in the latent space.
The training process consists of two steps:
Collecting training data
Fit the model
Collecting the training data can be done in two ways. The first approach is the more intuitive: A musician plays an instrument and has the 10 input parameters evaluated in real time. The second approach would be adjusting the input parameter values manually with the cursor. The Max4Live device (cf. Figure 2) offers a visual display of those parameters with the multi-slider. For the first intuitive approach of collecting data the two live.tab objects inside the training-view (cf. Figure 3) must be set to “auto” and to “select”. “Auto” stands for automatic parameter adjustment through the audio input signal. “Select” mode indicates that the model is not predicting any output, so the user can select constellations of input and output parameters to add to the dataset. The output parameters always have to be set manually by the user, so the neural network can learn which output parameter values should match to certain input parameter values. By playing and thereby bringing the input parameters into a certain state, the state of the input and output parameters can now be added to the model's dataset. This is done by pressing the Add-InOut-Pair button. Once enough In/Out pairs have been added, the model can be trained. The loss value is displayed after a training session. You can click the train-model button more often if needed to minimize the loss value. Since our model is very small, 2–4 training cycles are usually sufficient. However, if the loss value is very high, the data-set should be adjusted, if necessary, enlarged. Incorrect data sets can be deleted completely or single data-points can be erased. Preferably, the model and the data set should be deleted before selecting In/Out Pairs for a new model. If you are satisfied with a trained model, its weights can be saved as a json file and recalled.
At the current stage of this project, we used practice-based evaluation cycles to inform the software prototyping phase, similar to the practice-driven approach in . For compatibility reasons with M1 processors we decided to use M4L.RhythmVAE instead of more recent alternatives in our first experiments. The alleged limitation caused by the absence of triplets in the M4L.RhythmVAE device is well compensated by its micro timing abilities and our focus on joint exploration and improvisation over direct control makes a more precise visualization of the latent space distribution dispensable.
For our first experiments, we recorded different groove elements for each of 6 robotic percussion instruments and merged them to a training dataset for M4L.RhythmVAE consisting of 448 bars. As partner instrument for the VAE percussion and input for Latent Space Explorer one of the authors played an alto saxophone (cf. video 1).
Video 2 shows the training phase with four different musical actions on the saxophone mapped to certain latent space positions.
In this paper we presented a Max4Live device to interactively explore latent spaces of generative VAE. Our initial experiments glimpse the musical potential of this approach, although further developmental steps and evaluations e.g. regarding the size of the training sets and relevant features are needed to make more accurate conclusions.
For now, the Latent Space Explorer can be trained to generate a specific position for a certain musical action, e.g. playing fast, high notes with low ambitus means position xy. But what about other ways of navigating? For example the described musical input (fast, high notes) would not stand for a certain position, but for a certain way of moving through the latent space. Fast, high notes could be moving in positive x direction.
As with our previous work , we are planning to keep the musician’s perspective to increase the creative freedom for the AI-collaborator as much as possible. As a useful extension in future work in this project we see the addition of further input modalities such as body posture or EEG data.
Regarding the generated rhythms, we are planning a feedback mechanism that recognizes newly played rhythms during a performance and integrates them into the dataset. A switch away from VAE to more open network architectures e.g. using deep reinforcement learning  might also be helpful to achieve these further development steps.
The authors declare that the work presented was conducted in the absence of any conflict of interest (related to either commercial, financial or personal relationships) and in line with Principles & Code of Practice on Ethical Research. This material is based upon work in part supported by a grant from anonymous.