This collaboration emerged out of informal conversation between the authors about improvisation. Ben-Tal is a composer/researcher who has been using Music Information Retrieval (MIR) techniques and AI as tools for composition. Dolan is a performer/improviser and researcher on improvisation, creativity and expressive performance with little knowledge of music technology. Dolan became intrigued but also highly sceptical about Ben-Tal’s ideas of musical dialogues between human and computer as a basis for co-creation. They agreed to meet and trial the possibility of real-time improvisation between piano and computer. By his own admission, Dolan came to this first session assuming he will prove the inadequacy of such a set-up for joint improvisation based on extended tonal music idiom. He found himself equally surprised and alarmed when he experienced moments that felt, to himself, as real dialogue with the machine. This proof-of-concept session provided the starting point for an ongoing collaboration: developing a unique duo-improvisation within the context of computationally creative tools, real-time interaction, tonal-music and human-computer interaction. Central to this work are musical dialogues between Dolan on the piano and Ben-Tal’s computing system as they improvise together. These are surrounded and complemented by conversations between the authors about the system, about improvisation, composition, performance, music and AI.
This presentation starts from a description of the current improvisation set-up and the development that allowed us to arrive at this stage. The following section re-enacts some of the conversations that the authors engaged in, which will illuminate the learning and discovery process they underwent together. We will end by drawing out important themes emerging from the musical and meta-musical conversations in relation to current debates around music and AI.
We describe the role of the computer in these performances as an artificial improviser even though the term has no clear definition and the range of approaches used over the years is wide ranging[1]. Some systems aim to work within a particular musical style[2]. Others operate as an extension of an improvisor’s own practice[3]. This artificial improviser was developed by Ben-Tal with the explicit intention of creating a musical dialogue between an AI system and Dolan. Two fundamental design decisions informed the development of this artificial improviser. First, Ben-Tal is a composer: he did not want to develop a musical instrument. His preference was for a system that handles the moment-to-moment sound production automatically. His role – on stage – would be as a supervisor or executive: monitoring the system’s behaviour, influencing its responses but without the need for constant action. The second important design decision was to try and minimise hard-coded assumptions on the input. Most improvisation systems include such assumptions which impose restrictions on the human performer. Somax, for example, uses tempo when analysing the corpus and the audio during a performance[4]. This tempo can be adjusted but relying on a regular tempo has strong implications for the operation of the system. The Reflexiv Looper[5] and Shimon The Robot[6] both need a lead sheet as a frame when accompanying a human performer. While Dolan improvises within the realm of expanded tonality, the basic conventions of tonality are not coded. The computer does not look for a regular pulse, major-minor chords, cadences, or other stylistic conventions.
The initial development phase of the system used recordings made of Dolan improvising solo on the piano. Using these as simulations, Ben-Tal developed a modular system that uses machine listening techniques and offers a range of responses[7]. The system uses MIR tools to extract pitch information, rhythmic information, and some timbral features from the piano’s acoustic signal. The system isn’t trying to fit rhythmic information into predefined categories (pulse; duple/triple meter; bars; phrases) which would require hard-coding assumptions. Instead, the Kernel Density Estimation method (KDE)[8] is used to construct a statistical distribution of inter onset interval (IOI) values. The peaks of this distribution are used to estimate the prevalent rhythmic categories currently used by the pianist. This list is updated (running KDE again then identifying peaks) if too many IOI values fall outside the current categories. Notably, this approach does not provide metric information such as bars, weak and strong beats, pulse and subdivisions. Which means the system does not have access to such factors even though they are important in tonal music.
Chromagrams are used as a proxy for harmony – again an oversimplification compared to the deep knowledge of the language of tonal music that human musicians possess, but a simplification that avoids the need for complex analysis (difficult in real-time) and for hard-coded assumptions (e.g. look for triads; relate harmony to some putative tonic). The strongest n notes of the Chromagram are collected every second. This collection is the basis for generating much of the pitch material in the sounds the computer contributes to the dialogue. The system also uses the Prediction by Partial Match[9] method on Spectral Centroid data to estimate the predictability/surprise of the current event.
The sound production is handled by a growing collection of independent modules. In performance Ben-Tal activates and deactivates these modules. At any given moment any number - usually 3-5 - will be active. In some moments only one module is operating, but as many as 8 have been used simultaneously in performance. Most modules are defined by (1) their use of the data extracted from the performer (2) an internal musical-logic that defines their dynamic responses. Below is the outline of one such module:
take the list of rhythmic categories using the current estimate
append a rhythmic value 1.5x time the longest value
append a rhythmic value of 0.1 seconds (‘grace note’)
Randomly select a pitch from the n-strongest bins in the Chromagram. (according to the current value of n set by Ben-Tal)
Use a plucked string physical model to synthesise a note with the selected pitch.
Use weighted random from the rhythmic value list created in steps 1-3 for the IOI to the next note.
repeat
In this case both the rhythm and the pitch parameters are derived from the current piano context. The algorithm was shaped through an iterative process of listening to the output, in conjunction with the recorded improvisations, until Ben-Tal was satisfied with the result. For instance, initially the generated patterns lacked shape and sounded like idle noodling. The addition of the longer and shorter IOI categories created local gestures which added focus and shape. The timbral qualities of the synthesis, the weighting of the rhythmic value choice and the octave transposition from the Chromagram to the synthesis are similarly important factors that shape the result and were derived gradually through the process. In other words, the musical-logic of the algorithm encodes Ben-Tal’s compositional aesthetics in dialogue with Dolan’s musical idiom.
If we consider music as a combination of the ‘What’ dimension - compositional content - with the ‘How' dimension - shaping and interpreting this content - improvisation is the fusion of both dimensions - in real-time. The ‘What’ dimension includes the creation of melody, harmony, gestures, phrases, texture, etc. The ‘How’ dimension encompasses the real-time choices regarding performance-related parameters (e.g. dynamics/intensities, durations/timing/rhythms [rubati], changes of timbre and elaborations of thematic elements). Fusing both dimensions at the same time requires drawing simultaneously from different pools of know-how, and dealing with long-term structures (phrases, sections, movements’ overall form) as well as short-term ones (motives, gestures). While performing improvisations, Dolan functions at several structural levels/layers simultaneously, including awareness of directionality (from departure points to goal-points) within larger time units and dealing with shorter-term expressive gestures. In his structural thinking he relates to deep structural pulses of longer durations (whole-bar or double-bars, for example) underlying shorter time units and values above these inner pulses[10] For example, conceptualizing deeper harmonic movements framing motivic gestures on the foreground level. These inner pulses function as tactus, but instead of being fixed, they can grow longer or shorter according to the improvising performer’s needs in relation to what happens on the surface (the actual fully elaborated content performed).
The art of improvisation has been an integral part of European music-making culture up until the end of the 19th century. This, both while performing repertoire (elaborated/embellished repeats of themes, eingang & fermata points, cadenzas and interludes) and improvising fantasies, preludes and other forms independently of specific repertoire works[11]. While this phenomenon was mostly applied in the context of solo performance, Dolan's method and practice place significant emphasis on ensemble improvisation in various styles and aesthetic contexts, including Baroque, Classical, Romantic, post-Romantic, extended tonality, and tonally-free stylistic languages.
Improvising within such a tonal, modal, or extended tonal idiom requires the performer to adhere to stylistic/aesthetic constraints. This has important implications to the recently established link between improvisation, characterised by spontaneous responses to the unexpected[12], and states of flow[13]. The spontaneous aspect of behaviour most of us know from daily life is one of the elements in the teaching and practice method developed by Dolan, which focuses on fusing real-time spontaneous creative decision-making with a deep assimilation of knowledge that allows for its application in real-time, while in the flow of performing the improvisation. This is described by Ben-Tal as an Improvisational State of Mind. When improvising with others the twofold enhanced challenge is of keeping flow, expression and coherence of oneself within relevant aesthetic/stylistic references while integrating partners’ contributions. How could a mindless machine operate under these conditions?
(Two audio examples available online: example1, example2)
Ben-Tal (A1): How was the process of improvising with the computer different to developing improvisation with a new human performer?
Dolan (A2): Improvising, or indeed performing, with human musicians, benefits from various extra-musical or quasi-musical cues: the way they breath and move, body language, facial expressions, etc. The complete absence of such cues makes the fact that there is a meaningful musical exchange between me and that ‘thing’ (the computer) feel somewhat awkward or strange. Another major difference is the sonic landscape when the sounds I hear are all coming from the speakers, not from an acoustic source.
A1: This relates to one of the differences between your piano and my computer. The piano has an identity as a mechanical source for sound-making. It offers a limited range of sounds. In contrast, the computer has few limitations. The speakers, as a sound source, is one limitation and maybe processing power to a degree. But anything else is down to my choices and abilities to imagine and implement sound generation processes.
A2: Indeed, but there is another dimension to this identity of the instrument which is that when I am improvising as part of an ensemble. With a string quartet, for example, I become a part of a piano quintet. I am aware how the piano is part of this ensemble, and it's a slightly different instrument then when I am improvising solo or with another instrument like the trombone or the oud.
A1 It is interesting that you mention both a conventional ensemble like the piano quintet as well as improvising with an oud player. The long history of the former informs the improvisation and provides context for it – for the players as well as the audience. But with an oud player, you don’t have this context to draw on.
A2 That is true, and the main difference between these was in terms of musical language. With the oud player, we wanted to explore the possibility of our individual musical idioms together: a kind of intercultural exchange. So the challenge, for me as a pianist, was to dialogue with music grounded in classical Arabic art-music language. While with this ‘thing’ – your AI programme – the musical dialogue is flowing in one combined language. Which is what I found so surprising and somewhat uncanny.
A1 Can you explain what you mean?
A2 I hear and feel a dialogue, but I don’t know who exactly am I in dialogue with. And I don’t understand how this sense of dialogue is achieved. When I perform with others the shared experience - musical, expressive, emotional - is a large part of it. But here there is no shared experience. Yet there is, somehow, shared music making. Which is uncanny almost in the literal sense of the word: unknown and possibly unknowable.
A1 The system is designed for musical dialogue. It tries to listen to you and produce musical responses that are appropriate to the context. Appropriate is, of course, based on my aesthetic choice as a composer. So the system encodes an approach to musical listening as well as aspects of my compositional practice. At the same time, it was designed to work with you – I developed it with your improvisations in my mind and my ears. It is tailor-made. Which leads to another question: do you find that you need to listen to the system differently compared to human performers?
A2 It is difficult to pinpoint exactly how it is different, but it is. I noticed that there are fewer gradations or continuous shaping of tone/sound, timbre and dynamics generated by the computer compared to human musicians playing their instruments or singing.
A1 My system is modular with individual components which I activate and deactivate. Each activation brings in new sounds or new musical ideas. Is that what you are referring to? Or is it even with individual processes – with their distinct timbres – where there is less shaping of ‘notes’ and gestures?
A2 The “otherness” I am talking about relates also to individual processes, although the “entry” and “departures” of processes are obviously a significant part of the phenomenon. I think that when it comes to individual processes, the evolution of sound and the relations between how curves of intensity, timbre, pitch and durations unfold, are different compared with acoustical instruments. At the same time, your system is also not responding well to dramatic changes on my part. Unlike a human partner which would recognise a drastic shift into a new section.
A1 My system ‘listens’ to the local context: the pitch content, the time intervals between onsets, some timbral aspects. It does not ‘listen’ to longer time-spans to identify these contours you describe. This problem of structure in generative music (more accurately, generating music with coherent or plausible structure over longer durations) is not unique to my system[14]. And automatic segmentation of music - including identifying these section boundaries - is another perennial challenge even when processing off-line[15]. The processes I designed do have local structure - I think you refer to it as directedness - a shaping of the sounds and the events generated into gestures or patterns with focus. But the larger scale shaping is done by me and I only have crude controls at the moment, not nuanced continual shaping.
A2 Perhaps related to that, compared with performing with human musicians, I find myself more surprised by what I hear coming out of the speakers and I am not sure it is just because of the more varied timbres. I worked with extraordinary and very versatile musicians but this AI system “manages” to surprise me more often.
A1 Is this part of what you find rewarding about this collaboration?
A2 In part. Initially it was curiosity – I generally can’t resist new openings. And I find some of the sounds you create inspiring. After a few sessions together, an additional rewarding aspect is that the good moments become better each time. You remember that at one point out of nowhere I started a fugato texture. This is not something we tried before and yet your system followed me and contributed an appropriate fourth part to the texture, which felt like magic.
A1 Yes, that was a wonderful and very unexpected moment. My impression was that even you didn’t know you would embark on such a Baroque-style prelude with complex three-part counterpoint. And yet the system followed you. First, some of the processes I implemented transform, in non-trivial ways, the sounds of the piano. And as your texture became louder and denser the transformed version also increased in similar ways thus matching you. But the synthesised responses - where new sounds are generated by the computer based on the data extracted from your performance - also worked. There is enough connection in the way I designed the machine listening part and the way it recombines this information to produce new, but related, material.
A2 At the same time I also find it difficult to understand why this dialogue sometimes works but at other times it doesn’t. As if I and that ‘thing’ are at cross purpose. And it goes back to this uncanny feeling. With other improvisers I can feel, in the moment, when one of us is not sufficiently attuned and I know what I can do to regain this tuning-in. But not with the AI.
A1 This does sound like the uncanny valley phenomenon[16] - we find it difficult to deal with an AI that appears too human in some respects while failing in others. Perhaps the relative success of the system to respond to you raises your expectations from it - expectations it sometimes fails to meet. In some ways working with this AI is neither like working with another highly experienced and proficient human improviser, nor is it like the relationship you have with your students. It has elements of both and elements that are unique to the system.
A2 In some ways it is like a student in that I notice improvement from session to session and I wonder how much is down to you fine-tuning the code between sessions, how much is down to what you do live in the performance, and to what extent I am learning - not necessarily in a conscious way - to adapt to the patterns that the computer creates.
A1 I do make small changes in the code between our sessions based on what happens in them. It could be about the sound qualities and it could be about the way the local processes work - what I described as the musical-logic programmed algorithmically. And our discussion about how you construct your improvisation or how you conceptualise the music informs those changes. Not that I am trying to mimic your approach but I try to find ways of integrating the AI even better. But another element of this is the way we listen to each other. You learn to listen and respond to what you hear and I, in turn, learn how the computer responds to your input and how these sound together. I will give you an example. One of the modules I designed grabs a snippet of your sound, and then starts a short process of repetition and transformation of it. And it has been waiting there in the code and I rarely used it because I didn’t actually like the way it sounded. And then you were playing a motif of staccato notes and I realised this should work together. What I heard coming out when I turned this process on at that moment was nothing like what I heard when I developed this code but it worked. Sonically and musically. So it surprised me too though, when I turned this process on at that particular moment, I was expecting a good surprise.
A2 Am I right in saying that you are learning to perform your instrument - the computer - during this process?
A1 I am learning what the system is capable of doing, but it isn’t a musical instrument: I do not shape sounds and material. I off-load those moment-to-moment decisions to the computational process. I learn how they sound, in relation to what you do, and develop them (between sessions. I don’t alter the code live in performance) in relation to what I hear and what I would like to hear.
A2 In other words, this performance is a trio not a duo: you, me and your computer.
A1 In a sense, yes. But it is an unequal trio. The computer and myself are partners in the creative process. It is a co-creative system where I have overall executive control. And while the sounds we contribute are more varied than yours, I think your contribution carries more musical weight. What I mean is that you mostly lead, even when you take inspiration from the system. You also provide the context within which we hear the electronic sounds and judge whether the dialogue works or not.
A2 An important question is what our listeners get from our duo improvisations performances in terms of emotional and aesthetic engagement. How do they respond to the extra-human aspect of it? As such research in the context of ensemble improvisation performed by humans exists, it will be fascinating to compare. It begs for an audience research component, don’t you think?
A1 It is a good question and is related to my presence at the computer on stage. We can tell the audience that my computer is not a musical instrument but what they see are two humans, each working on a machine. So your (and my own) understanding of the extra-human aspect could be different to the audience’s. So if we want to study what is taking place we need to consider very carefully how we present the work and what to measure.
A2 I am wondering where is Ben-Tal in this improvisation? In what ways would it be different if another composer created such an artificial improviser?
A1 I embed my aesthetics and compositional ideas into the code. The code is an extension of me as a composer and it enacts what I want to hear in conjunction with your performance. The system tries to listen to you and the way it ‘listens’ reflects my priorities as a composer. Heavily constrained by the limitations of MIR techniques and my ability as a programmer. More significantly, the responses the system generates use the data that comes out of this listening process, but recombine this data in ways that reflect my ideas as a composer. Finally, during the performance I adjust aspects of the system in real-time based on what I would like to hear. The system is modular - I have about a dozen independent processes that I can turn on or off. Each one takes something from you and uses it in particular ways. I listen to the current music and imagine what adding or removing a module would do. Or I imagine the outcome of any adjustments to the parameters I have control over. For example, for one of the modules I can change the register (octave transposition) of the resulting sounds. I can decide to place those sounds in the same register I hear from you or shift them to a different one. Of course, I can’t predict whether you will continue in a similar manner or decide to change register in response.
A2 Can you explain what you mean that the code reflects your ideas as a composer?
A1 My introduction to computer music was through algorithmic composition - finding ways of enacting the sounds, materials, processes and time-structures I was after through lines of code. This is an iterative process: run code, listen to the result, change the code to improve the result. After a few iterations hearing fatigue makes it difficult to listen critically and evaluate whether the outcome is better. Another challenge is that for every sound you want to create you have to specify many parameters. Depending on which technique you use for the synthesis you might need dozens of parameters for each sound. My solution to both these challenges was to include constrained randomness in the generation. I would directly specify some parameters - parameters that are important in this particular moment. Others would be chosen randomly each time I evaluate the code, though the choice would be constrained. The result would be slightly different each time, allowing me to listen critically. And I would adjust the parameters until the result would almost always be satisfactory. This approach is still there in this AI system. The responses it generates are musically appropriate and predictable on a statistical level. The precise events are not predictable, but the overall effect is. And the system is able to both follow your lead and surprise you at times because it is computationally stable, the machine listening embedded in it makes it responsive, and it encodes musical ideas that you find engaging.
“Developing computer systems that collaborate in musical improvisation is a complex and challenging task due to the spontaneous and immediate nature of the musical dialogue required.”[17] In this paper, we describe the first stages of a successful development of such a system. The system is specific and tailor-made; it was designed to work with a specific human performer on a specific instrument (the piano). It also encodes specific compositional ideas: encompassing the type of material generated, the sonic properties, as well as the nature of the dialogue with the human improviser. Improvisation is a powerful test-bed for music AI because it is such a complex and challenging task. But also because improvisation encompasses many of the key components of musicality: listening, discerning, imagining.
Our system learns from the human performer, but does not employ the latest machine learning techniques with their reliance on extensive data and high-end computational power. Instead, the system deploys older approaches to AI (e.g. rule-based systems; Markov-based models) in inventive ways. The learning aspect in the system is based on machine listening coupled with musically sophisticated uses of the data extracted - live and in real-time from the performer - to generate material that is both novel and relevant to the musical context. In other words, while we see impressive improvements in the capabilities of AI systems through the combination of large data-sets and ever more powerful computing power, other approaches have not been exhaustively explored. And these approaches have the added bonus of being explainable and transparent. Unlike black-box models, Ben-Tal is able to modify the system to implement his musical ideas.
This AI system analyses data and then generates new data based on that analysis. However, while most recent music machine learning research invest the most effort - in terms of time, computing power, data acquisition and innovation - into the learning (i.e. analysis) part, Ben-Tal’s approach is the opposite. He invests time, creativity and thought in the music generation side of the pipeline. Prioritising the learning phase is based on an implicit assumption that constructing better models will yield better outputs. However, when the definition of a successful model is not clear - as is the case for creative AI - this might not be the best route[18].
Our research benefited from the integrated and collaborative development process. The AI system was (and still is) constantly tested on the basis of music-making. We trial the system during regular rehearsals where both authors hear the combined result (we also recorded some of the rehearsals allowing for repeated listening) and discuss them. We also invited friends and colleagues to some rehearsals and performed in front of audiences in concert situations. In addition, the fact that Dolan came into this process with little knowledge of AI or music technology required Ben-Tal to find ways of explaining what the system does and how it does it in musical and non-technical terms. We hope to see more research in AI music creativity that integrates the musical tasks and musical questions as part of the research methodology in similar ways.
Finally, our system benefited from a slow development process with gradual improvements and refinement. As Professor Sageev Oore - Canada CIFAR AI Chair, and visiting research scientist at Google Brain - highlighted[19] successful music-making tools - such as the violin or the piano - were developed slowly over time. Both performers and composers contributed directly and indirectly to this development process. Therefore, It is unrealistic to expect music AI technology to emerge fully formed from a lab. We would hope to see a more sustained effort to involve musicians actively in the development process and allow time for mastering the creative potential of music AI applications.
Two important aspects of improvisation are missing from the AI at the moment. One is the ability to consider larger musical context - what we might describe as narrative or structure. This relates to both the listening and the generation parts of the system. We want to expand the machine listening beyond the local context and try to extract musical trajectories and identify sections and transitions. This should involve integration of different musical parameters since harmony, melody, tempo and texture are often interleaved in defining these larger scale aspects. For the generative side of the system, a good improviser should be able to generate a plan for the longer-term while updating it constantly based on the developing context.
The second aspect still missing - and which also relates to both the listening and generating sides - is nuance. A more nuanced listening to the performer will aim to identify expressive intent achieved through means such as articulation, prosody or rubato. But, in the absence of a score to provide a framework, there is no obvious way of identifying these computationally. Applying such performance nuance to the generated material is a less daunting task technically, but the question of what would be the machine equivalent of expressive intent remains open. Though it could potentially be linked to the previous aspect of musical trajectories or structure .
Finally, we want to add the dimension of audiences’ evaluation and responses to the musical dialogue on stage. To what extent does the audience share our perception of musical dialogue between human and AI? Is this perception germane to the audiences’ engagement with the music? If we want AI technology to become part of the future of music we should consider its development in relation to both musicians as well as the audiences.