Rolypoly~ is an expressive performance agent designed to anticipate and respond to a human musician in the context of interpreting a symbolic drum part. This demo presents and contextualizes the second version of the system, discussing its use cases and research implications. The software is rebuilt from the ground up as an encoder-decoder Transformer model running inside a Max object directly, rather than communicating with a Python backend. The model is pretrained on the Groove MIDI Dataset and is able to perform any drum score (loaded as a .mid file) with velocity and timing variations that adapt to the human musician. The agent aims to play in a way that lets it accurately predict upcoming onset timings in the human’s audio signal. In turn, users are able to finetune the machine agent iteratively over repeated duet performances, much like how two human musicians would learn and rehearse a piece. Finally, a generative mode, where the drum machine switches to a free “comping” mode, is in an experimental stage and available for testing.
This demo features version 2.0 of rolypoly~ [1]: an interactive agent that modulates the microtiming, or groove, of a given drum part to complement a human musician’s predicted timing. The guiding intuition is that of performing drummers continuously adapting their inner-beat groove in anticipation and reaction to their partners on a moment-to-moment scale [2].
The new version runs entirely within a Max external object, without requiring a Python backend for inference or finetuning. The open-source framework1 now allows for easy integration of new ML (machine learning) models, with the main model being an encoder-decoder Transformer [3]. The system’s performance-finetuning loop is its main contribution to the field of ML-powered music co-creation, enabling a dynamic feedback loop of listening and generation [4]. This strategy employs active divergence [5] through finetuning and domain transformation [6], to continually reconfigure the action space, conditioned by a specific music score.
The area of generative AI for drum performance has flourished in the last three years, enabled in large part by Magenta’s Groove MIDI Dataset (GMD) [7] and similar drum groove corpora [8].
Transformers have surpassed recurrent neural networks not only in language modelling, but also for time series prediction [9]. Drum generation systems are following suit, using variational autoencoders [10][11][12][13] and/or Transformers [12][14][15][16][17] for production (beat/loop making [10][11][13][18][15][16] or accompaniment [14][12]) or live performance [11][17], sometimes with human guidance [13][18]. However, the handling of sequences longer than 2-bar loops in 4/4, and prediction-based interaction, both remain largely unexplored, with real-time systems mostly relying on fixed-grid, reactive call-and-response schemas [17].
The system architecture draws from the analogy between natural language translation (the task of the original Transformer [3]) and music performance. The encoder is fed the entire sequence sans expressive information, and the decoder processes the generated timesteps. Fig. 1 shows the data flow diagrams for all three phases involved in the system. The data representation is specified in Appendix 1.
The default Transformer in rolypoly~ has 6 encoder and decoder block layers with 16 attention heads, an internal dimension of 64, and a feed-forward dimension of 256. The decoder receives a maximum ofblock_size
= 16 steps.
However, other TorchScript modules can be loaded into the Max external instead: see Appendix 2 for a basic example. The modular framework makes using alternative ML models quite frictionless. Different data representations can also be supported by modifying the preprocessing in data.py
.
The training and finetuning phases are described in Appendix 3.
The Max external object is written in C++ using the Min DevKit2 and LibTorch3, inspired by the nn~ external[19] which powers the RAVE model [20]4. The main benefits of a standalone Max object are:
ease of use: a unified tool for loading MIDI5, audio input, triggering drums, and finetuning the model;
sample-accuracy, providing sub-millisecond timing resolution;
no timing inconsistencies from Max-Python-Max communication.
The only dependency is an onset detector from the FluCoMa toolkit [21], latency-compensated inside rolypoly~ for sample-accurate consistency.
The following are three videos of the system in action.6 The first example shows a standard use case. The generated velocities are fairly erratic in the pretrained phase, confirming that Transformers struggle replicating the velocity distributions7 in the GMD [17]. This issue is alleviated from the very first finetuning.
Video 2 shows an example of trying to force the machine to change from a “swinging” rhythm to “straight”:
Finally, we can obtain drastically out-of-sample behaviour by subverting expectations. Video 3 shows an example of (after a couple of regular play+finetune stages to establish the model’s prediction pattern) systematically playing the downbeats early. The resulting rhythm is nearly unrecognisable:
Most current drum generation models are relatively constrained in scope compared to past expressive performance models [22] or many modern generative systems [23]. Rolypoly~ works against the bias towards reconstruction in generative AI [6] by targeting longer sequences, at varying tempo and meter, over repeated performances.
The system’s iterative workflow and its live feedback loop make systematic evaluation challenging [1]. However, they lend themselves well to a hands-on demo session.
For any music AI, one must consider the ethical dimension. A main benefit of this technology is that it is additive, not zero-sum. It is not designed (nor is it likely) to replace human drummers, but rather to transform existing drum sequencing software. Moreover, rolypoly~ embraces a “small-data mindset” [24]: after pretraining on a public dataset, the model adapts to bespoke, user-produced data. Still, a pretraining scheme on a big dataset of scores+performances is conceivable8, which might improve the system’s salience but also raise data ownership and privacy concerns similarly to OpenAI’s Jukebox [25] et al.
As we continue to develop such systems, we must also attend to issues of implicit bias, enforced patterns, unintended uses, restricted access and so on.
Rolypoly~ 2.0 is currently in a functional, alpha stage. The following work-in-progress features will be ready for testing in time for the demo session in August 2023:
a “free” generative mode, where the machine is allowed to stray away from the score. Preliminary tests indicate that simply decoupling the encoder is not sufficient to produce convincing output. Still, related work in generating continuations [15] and infilling [14] offers hope.
including alternative models such as the Seq2Seq model from version 1.0 [1], which appeared to perform better on the rhythm morphing exercise9 and could provide a useful benchmark.
Looking ahead, several research and development directions remain open.
Currently the system only allows for a binary choice on finetuning between runs. I am investigating solutions for more complex feedback and steering over different musical units and timescales [13][26].
Another intriguing development involves opening up the system to other sources of symbolic material, such as composition or improvisation agents [27], as inputs to rolypoly~ ’s encoder.
Moving forward my main goal is a stable release, and ports to Pure Data and possibly VST. These will enable a broader community to access and make music with rolypoly~, which has always been the driving goal of the project.
This research has not received any funding. The conference participation is initiated through the Dept. of Animation & Interactivity, funded as research activity at UNATC “I.L. Caragiale”. There are no conflicts of interest. The dataset used is open and publicly available. The software was developed on consumer hardware and trained using a single-GPU desktop machine. The source code and pretrained model are publicly shared, reproducible and extensible. Other ethical questions pertaining the project are discussed above.
To present the interactive demo I plan to provide:
laptop PC w/ audio interface
a pair of headphones
a guitar and a microphone
Setup requirements:
a table w/ access to power
optionally, a pair of speakers (using 1/4 inch TRS cables)
The decoder processes vectors containing the following components:
9 played velocities
10 timing values:
9 drum hit offsets (as fraction of bar,
input audio onset timing prediction (as fraction of bar,
3 location features
local tempo, in BPM
local time signature, a float value (e.g. 3/4 and 6/8 resolve to 0.75)
beat phase, between 0 and 1: the relative position in a bar
Meanwhile, the encoder receives just 9 scored velocities and 3 location features, with no microtiming information available.
While this representation is more compact (22 features versus 27 in [7] et al, and no empty rows), the relative merits of different representations warrant further investigation. One comparative study of rhythmic representations [28] suggests that a basic 16-step grid would have been a better starting point, but it does not address the possible advantage of including location features in the vector.
Example of a TorchScript module that naively nudges off-beat hits.
model.py
:
class Swing(nn.Module):
def __init__(self):
super(Swing, self).__init__()
def forward(self, _, x):
for i in range(x.shape[1]):
if data.offbeat(x[0, i, constants.INX_BAR_POS]):
# if we're on an offbeat, nudge the note forward
nudge = data.bartime_to_ms(0.05, x[0, i, :])
x[0, i, constants.INX_TAU_D] = nudge
return x
data.py
:
def offbeat(bartime: torch.Tensor) -> bool:
"""
Check if a bar-relative time is on an off-beat.
input:
bartime = time to be checked
"""
bartime = torch.tensor(bartime)
for i in range(5):
if torch.isclose(bartime, torch.tensor(i/4), atol=0.05):
return False
return True
The default model in rolypoly~ 2.0 is pretrained on the GMD in a similar fashion to [1] with the velocity info removed by averaging the non-zero hits for each drum category. Teacher forcing is used for the
After a performance is complete, the user has the option to send the train
message to the Max object, which triggers a finetuning of the model. The current specifications are 20 epochs of a forward-backward pass of the latest data through the model, at a learning rate of 0.003 with the Adam optimizer.
The domain to be optimized is altered from the pretraining stage. Since we now have records of the predicted audio timings
where:
each of the 4 terms has a user-adjustable weighting coefficient, used both for normalization and steering.