Skip to main content
SearchLoginLogin or Signup

Finetuning Rolypoly~ 2.0: an expressive drum machine that adapts with every performance

Published onSep 04, 2023
Finetuning Rolypoly~ 2.0: an expressive drum machine that adapts with every performance


Rolypoly~ is an expressive performance agent designed to anticipate and respond to a human musician in the context of interpreting a symbolic drum part. This demo presents and contextualizes the second version of the system, discussing its use cases and research implications. The software is rebuilt from the ground up as an encoder-decoder Transformer model running inside a Max object directly, rather than communicating with a Python backend. The model is pretrained on the Groove MIDI Dataset and is able to perform any drum score (loaded as a .mid file) with velocity and timing variations that adapt to the human musician. The agent aims to play in a way that lets it accurately predict upcoming onset timings in the human’s audio signal. In turn, users are able to finetune the machine agent iteratively over repeated duet performances, much like how two human musicians would learn and rehearse a piece. Finally, a generative mode, where the drum machine switches to a free “comping” mode, is in an experimental stage and available for testing.


This demo features version 2.0 of rolypoly~ [1]: an interactive agent that modulates the microtiming, or groove, of a given drum part to complement a human musician’s predicted timing. The guiding intuition is that of performing drummers continuously adapting their inner-beat groove in anticipation and reaction to their partners on a moment-to-moment scale [2].

The new version runs entirely within a Max external object, without requiring a Python backend for inference or finetuning. The open-source framework1 now allows for easy integration of new ML (machine learning) models, with the main model being an encoder-decoder Transformer [3]. The system’s performance-finetuning loop is its main contribution to the field of ML-powered music co-creation, enabling a dynamic feedback loop of listening and generation [4]. This strategy employs active divergence [5] through finetuning and domain transformation [6], to continually reconfigure the action space, conditioned by a specific music score.

Related work

The area of generative AI for drum performance has flourished in the last three years, enabled in large part by Magenta’s Groove MIDI Dataset (GMD) [7] and similar drum groove corpora [8].

Transformers have surpassed recurrent neural networks not only in language modelling, but also for time series prediction [9]. Drum generation systems are following suit, using variational autoencoders [10][11][12][13] and/or Transformers [12][14][15][16][17] for production (beat/loop making [10][11][13][18][15][16] or accompaniment [14][12]) or live performance [11][17], sometimes with human guidance [13][18]. However, the handling of sequences longer than 2-bar loops in 4/4, and prediction-based interaction, both remain largely unexplored, with real-time systems mostly relying on fixed-grid, reactive call-and-response schemas [17].


The system architecture draws from the analogy between natural language translation (the task of the original Transformer [3]) and music performance. The encoder is fed the entire sequence sans expressive information, and the decoder processes the generated timesteps. Fig. 1 shows the data flow diagrams for all three phases involved in the system. The data representation is specified in Appendix 1.

Figure 1: Data flow.

Left: pretraining. Offset and velocity information is learned from the GMD dataset. The vv values sent to the encoder are averages of non-zero velocities for that particular drum in that particular recording.

Center: live inference. The decoder outputs predicted drum and guitar offsets, but at its input the actual positions and guitar timings ( PosPos and gg ) are fed in from the score and audio input respectively. Dotted arrows represent online, stepwise communication.

Right: finetuning. The decoder receives all the predicted and realised features, retained from the preceding performance.

The default Transformer in rolypoly~ has 6 encoder and decoder block layers with 16 attention heads, an internal dimension of 64, and a feed-forward dimension of 256. The decoder receives a maximum ofblock_size = 16 steps.

However, other TorchScript modules can be loaded into the Max external instead: see Appendix 2 for a basic example. The modular framework makes using alternative ML models quite frictionless. Different data representations can also be supported by modifying the preprocessing in

The training and finetuning phases are described in Appendix 3.

The Max external object is written in C++ using the Min DevKit2 and LibTorch3, inspired by the nn~ external[19] which powers the RAVE model [20]4. The main benefits of a standalone Max object are:

  • ease of use: a unified tool for loading MIDI5, audio input, triggering drums, and finetuning the model;

  • sample-accuracy, providing sub-millisecond timing resolution;

  • no timing inconsistencies from Max-Python-Max communication.

The only dependency is an onset detector from the FluCoMa toolkit [21], latency-compensated inside rolypoly~ for sample-accurate consistency.


The following are three videos of the system in action.6 The first example shows a standard use case. The generated velocities are fairly erratic in the pretrained phase, confirming that Transformers struggle replicating the velocity distributions7 in the GMD [17]. This issue is alleviated from the very first finetuning.

Video 1: Standard usage. Each finetuning phase contributes expressive nuances.

Video 2 shows an example of trying to force the machine to change from a “swinging” rhythm to “straight”:

Video 2: Rhythm morphing. Attempting to make the model switch from “swing” to “straight”.

Finally, we can obtain drastically out-of-sample behaviour by subverting expectations. Video 3 shows an example of (after a couple of regular play+finetune stages to establish the model’s prediction pattern) systematically playing the downbeats early. The resulting rhythm is nearly unrecognisable:

Video 3: Subverting predictions. Playing downbeats early turns the beat “upside down”.

Discussion and perspectives

Most current drum generation models are relatively constrained in scope compared to past expressive performance models [22] or many modern generative systems [23]. Rolypoly~ works against the bias towards reconstruction in generative AI [6] by targeting longer sequences, at varying tempo and meter, over repeated performances.

The system’s iterative workflow and its live feedback loop make systematic evaluation challenging [1]. However, they lend themselves well to a hands-on demo session.

For any music AI, one must consider the ethical dimension. A main benefit of this technology is that it is additive, not zero-sum. It is not designed (nor is it likely) to replace human drummers, but rather to transform existing drum sequencing software. Moreover, rolypoly~ embraces a “small-data mindset” [24]: after pretraining on a public dataset, the model adapts to bespoke, user-produced data. Still, a pretraining scheme on a big dataset of scores+performances is conceivable8, which might improve the system’s salience but also raise data ownership and privacy concerns similarly to OpenAI’s Jukebox [25] et al.

As we continue to develop such systems, we must also attend to issues of implicit bias, enforced patterns, unintended uses, restricted access and so on.

Current work

Rolypoly~ 2.0 is currently in a functional, alpha stage. The following work-in-progress features will be ready for testing in time for the demo session in August 2023:

  • a “free” generative mode, where the machine is allowed to stray away from the score. Preliminary tests indicate that simply decoupling the encoder is not sufficient to produce convincing output. Still, related work in generating continuations [15] and infilling [14] offers hope.

  • including alternative models such as the Seq2Seq model from version 1.0 [1], which appeared to perform better on the rhythm morphing exercise9 and could provide a useful benchmark.

Future work

Looking ahead, several research and development directions remain open.

Currently the system only allows for a binary choice on finetuning between runs. I am investigating solutions for more complex feedback and steering over different musical units and timescales [13][26].

Another intriguing development involves opening up the system to other sources of symbolic material, such as composition or improvisation agents [27], as inputs to rolypoly~ ’s encoder.

Moving forward my main goal is a stable release, and ports to Pure Data and possibly VST. These will enable a broader community to access and make music with rolypoly~, which has always been the driving goal of the project.

Ethical statement

This research has not received any funding. The conference participation is initiated through the Dept. of Animation & Interactivity, funded as research activity at UNATC “I.L. Caragiale”. There are no conflicts of interest. The dataset used is open and publicly available. The software was developed on consumer hardware and trained using a single-GPU desktop machine. The source code and pretrained model are publicly shared, reproducible and extensible. Other ethical questions pertaining the project are discussed above.

Tehnical rider

To present the interactive demo I plan to provide:

  • laptop PC w/ audio interface

  • a pair of headphones

  • a guitar and a microphone

Setup requirements:

  • a table w/ access to power

  • optionally, a pair of speakers (using 1/4 inch TRS cables)

Appendix 1: Data representation

The decoder processes vectors containing the following components:

  • 9 played velocities v^\hat{v}, for 9 drum categories as in [7]

  • 10 timing values:

    • 9 drum hit offsets (as fraction of bar, o^\hat{o})

    • input audio onset timing prediction (as fraction of bar, g^\hat{g})

  • 3 location features PosPos, specified in the score:

    • local tempo, in BPM

    • local time signature, a float value (e.g. 3/4 and 6/8 resolve to 0.75)

    • beat phase, between 0 and 1: the relative position in a bar

Meanwhile, the encoder receives just 9 scored velocities and 3 location features, with no microtiming information available.

While this representation is more compact (22 features versus 27 in [7] et al, and no empty rows), the relative merits of different representations warrant further investigation. One comparative study of rhythmic representations [28] suggests that a basic 16-step grid would have been a better starting point, but it does not address the possible advantage of including location features in the vector.

Appendix 2: Alternative model

Example of a TorchScript module that naively nudges off-beat hits.


class Swing(nn.Module):
    def __init__(self):
        super(Swing, self).__init__()

    def forward(self, _, x):
        for i in range(x.shape[1]):
            if data.offbeat(x[0, i, constants.INX_BAR_POS]):
                # if we're on an offbeat, nudge the note forward
                nudge = data.bartime_to_ms(0.05, x[0, i, :])  
                x[0, i, constants.INX_TAU_D] = nudge
        return x    

def offbeat(bartime: torch.Tensor) -> bool:
    Check if a bar-relative time is on an off-beat.
        bartime = time to be checked
    bartime = torch.tensor(bartime)
    for i in range(5):
        if torch.isclose(bartime, torch.tensor(i/4), atol=0.05):
            return False
    return True

Appendix 3: Training

The default model in rolypoly~ 2.0 is pretrained on the GMD in a similar fashion to [1] with the velocity info removed by averaging the non-zero hits for each drum category. Teacher forcing is used for the PosPos features into the decoder. After pretraining the model is saved to then be run and further finetuned in Max.

After a performance is complete, the user has the option to send the train message to the Max object, which triggers a finetuning of the model. The current specifications are 20 epochs of a forward-backward pass of the latest data through the model, at a learning rate of 0.003 with the Adam optimizer.

The domain to be optimized is altered from the pretraining stage. Since we now have records of the predicted audio timings g^\hat{g} and the actuals gg, as well as the realised drum offsets o^\hat{o} and velocities v^\hat{v}, we can formulate the loss as:

L=r+MSE(v^,v)+MSE(o^,g)+MSE(g^,g),L = r + MSE(\hat{v},v) + MSE(\hat{o}, g) + MSE(\hat{g}, g),


  • rr is a regularization term minimizing the squared distance from the average and standard deviation of o^\hat{o} to those in the GMD

  • vv are the scored velocities, as fed into the encoder

  • each of the 4 terms has a user-adjustable weighting coefficient, used both for normalization and steering.

No comments here
Why not start the discussion?