This paper introduces an AI-driven and animated interactive musician system. Current AI-driven systems that accomplish interaction with a human musician often lack visual representation, or require building robots to play acoustic instruments. Our system provides an infrastructure for creating custom interactive musicians on a personal computer. Throughout the paper we introduce the system pipeline, our design process, feedback from semi-structured interviews with musicians, and how we addressed feedback to inform improvements and limitations for interaction. Findings suggest that many musicians desire interactive AI-driven improvisation systems that are explicitly controlled by the player as compared to free-form musical improvisation with autonomous virtual musicians. We share the interactive and animated AI musician system for public use and collective improvement, complete with an associated repository1.
Music systems powered by artificial intelligence (AI) have emerged as a promising area of research in the field of music technology. These systems are capable of generating music in real-time, adapting to different musical styles, and even improvising in response to human input . However, many AI-driven music systems lack a visual representation of the performance, which limits their potential usefulness for musicians and audiences alike.
Visual representations during AI-driven musical performances can provide valuable insights into performance expression, enhancing our understanding and appreciation of the music . They can provide a new means for musicians to collaborate and explore new musical ideas , and can also provide valuable tools for music educators to teach and engage students. Moreover, the lack of visual feedback can make it challenging for musicians to interact with and control AI-driven music systems, and may hinder their creative potential.
Building robots as visual representations of AI-generated music can be costly and technically complex. As an alternative, some researchers are exploring the development of virtual visual representations that can provide a more accessible and cost-effective means of visualizing AI-generated music .
In this paper, we survey the state of the art in AI musicianship, examining the key challenges, limitations, and opportunities in this field. We discuss a virtual musician (VM) system that we built using a laptop, open source software, and Ableton Live 11 + Max for Live. We believe it is important to incorporate feedback from domain experts to inspire new directions and start to understand the unexplored potential of Human-AI collaborative systems. Therefore, we report on semi-structured interviews conducted with musicians and outline how we incorporated feedback into our final system, as well as limitations we encountered.
The development of AI musicianship is a growing field that has garnered significant attention in recent years. In this section, we provide an overview of the existing works in the field of AI musicianship, highlighting the key contributions and limitations in the areas of real-time robotic and virtual musicianship.
Recent advancements in music technology have led to the development of AI-driven systems that can assist in musical performance and composition. This section discusses studies that explore the use of machine learning techniques in musical performance and composition.
Several systems have been developed to assist real-time performance specifically. Notably, Smailis et al.  reflect on the musicality of machine learning-based music generators in real-time jazz improvisation. The authors use the OMax-ImproteK-Djazz system, which uses machine learning to generate improvisational music in response to live jazz performances. Similarly, Trump et al.  present the "Spirio Sessions," which explore human-machine improvisation with a digital player piano.
Other systems have focused on accompanying players in a wider range of musical style. Haki et al.  present a real-time drum accompaniment system that uses transformer architecture and a diverse database of recorded drum beats. The system is designed to respond to a live musician's playing by generating complementary drum patterns in real-time. The authors use machine learning techniques to analyze and learn from a database of drum patterns, and the system is able to adapt to the specific nuances of the live performance. The study demonstrates the potential for AI-driven systems to create responsive and dynamic musical accompaniment.
Google Magenta's AI duet  is another example of AI-driven musical performance system. The system uses machine learning to generate a musical response to a user's input in real-time. Users can play a melody on a MIDI keyboard and the AI system will generate a musical accompaniment that responds to the user's input. The Magenta team encourages players to train the model on custom datasets and has built custom libraries to support the exploration of various musical styles and instruments.
Overall, these studies demonstrate the potential of AI-driven systems to augment human musical performance and composition. The development of such systems opens up new avenues for musical exploration and creativity, and suggests exciting possibilities for the future of music technology. The use of visual representations could provide additional feedback to the player on the output generated by the AI system. For example, a VM could display visual cues that indicate which notes are being played or which instruments are being used.
This could help players better understand the output generated by the AI system, and make it easier to identify potential issues or areas for improvement. The use of visual representations such as VMs can potentially enhance the usability, control, and engagement of AI-driven musical systems, ultimately leading to more creative and innovative musical tools and performances.
Musical and automated robots have a long history of use in musical performance . However, only within the last few decades have researchers focused on creating adaptive, improvisational robots that display musicianship.
One of the earliest works in the field of robotic musicianship was conducted by Weinberg et al. and Eigenfeldt et al. , who independently introduced robotic drumming systems that used sensors and actuators to control a robotic drummer. The authors showed that the system was capable of playing complex rhythms and adapting to different musical styles. However, the system was limited in its ability to generate music and was only able to improvise with human input in a call and response manner.
Another notable work in this field was conducted by Barton’s HARMI (Human and Robotic Improvisation) system , who introduced a software and hardware system to generate music and respond to human input in real-time. The authors showed that the system was able to generate high-quality music that was comparable to human-played pieces, and that it was able to respond to human input in real-time.
More recently, a number of works have focused on developing robotic musicianship systems that are capable of playing a wider range of instruments and generating music in real-time. For example, Hoffman  proposed a robotic marimba player which was capable of call and response, overlay improvisation, and phrase-matching. The authors showed that the system was able to generate accompaniments that were consistent with the human player and had a high degree of musical coherence. Though musicality has been demonstrated to a high degree and notable work has gone into expressive gesturing with the robots, there is still progress that can be made to make the robots more expressive. This expressivity is also directly limited by the form and capabilities of the robotic movement.
Overall, the development of robotic musicianship has seen a great deal of progress in recent years, with numerous works aimed at improving the capabilities of these systems and making them more accessible to a wider audience . However, there is still much work to be done to make them more user-friendly and accessible–potentially a role VMs are poised to more effectively accomplish.
Early works in the field of virtual musicianship began by investigating animations that represent musical action . One of the earliest works in this field was conducted by Kragtwijk et al , who introduced a virtual drummer system that used computer graphics and animation techniques to simulate the performance of a human drummer. The authors showed that the system was capable of animating a virtual drummer based on audio events played by a human drummer. However, the system did not generate music, but simply animated a virtual avatar based on musical information.
More recently, a number of works have focused on developing virtual AI-driven musicians that are capable of playing a wider range of instruments and generating music in real-time . The authors showed that these systems were able to generate high-quality music using deep learning techniques that were comparable to human-played pieces, and that they were able to respond to human input in real-time. McCormack et al. pioneered the use of facial expressions for an improvising AI providing feedback using an emoticon to express representations of confidence between performer and AI . However, many of these systems lack visual representations of a VM making them limited in their ability to communicate through gesture and other forms of visual communication.
In an effort to address the limitations of early VMs, several works have focused on developing VMs that are more expressive . For example, Schmitz proposed a performance UnStumn - Artificial Liveness that focuses on communication of human and machine actors using an AI-driven, VM and an AI video artist in extended reality . Though the VM was not anthropomorphic the interaction between agents was visualized through virtual artistic renderings. Similarly, performances by Saffiotti, Thorn, and others have shown that human-AI collaborative musical performance occur in much the same way as robotic musicianship, but need not be limited by mechanical components .
Overall, the development of VMs has seen a great deal of progress in recent years, with numerous works aimed at improving the realism and expressiveness of these systems—proven to be a major challenge in VM systems. However open-ended and challenging, the use of visual representations in AI music such as VMs has the potential to enhance the usability and effectiveness of AI-driven musicianship systems, and could help address some of the limitations of current approaches. However, there is still much work to be done to improve the musical capabilities of these systems and to make them even more user-friendly. Our proposed system provides an open-sourced framework that can be customized and improved upon in various ways to support the musicality and expressiveness of VMs.
This section describes the software components involved in our AI musician system, how it is connected, and how a player can interact with the VM. The repository and setup instructions for the system can be found in the appendix2.
Generating music is accomplished by recording and saving files to the local machine and continuously using updated files to be used as input for trained neural network models. We did this using Ableton Live 11 in real-time with a custom MAX for Live instrument. The output from the model is then read and played via virtual ports. Musicians used any MIDI instrument of their choice for input. We used Jamstik Studio MIDI guitars, AKM322 MIDI keyboards, and a KAT KTMP1 MIDI drum for the tested inputs.
We leveraged models created by the Google Magenta team to minimize the complexity of the the system. This enables people at almost any level of software development to utilize complex and effective neural network models to generate music based on music databases of their choice.
Any model created by the Google Magenta team can be used to create the AI-driven interactions discussed in this paper. We leveraged several models in this pipeline to create reactive guitar, piano, drums, percussion, and bass players. Most models use a Long Short-Term Memory (LSTM) recurrent neural network (RNN) architecture to create generative music. The drum model, for example, was trained using the Groove Database which consists of 13.6 hours of drum music in MIDI, curated by the Google Magenta team. Other models, MelodyRNN and PolyphonyRNN were utilized for guitar, bass, or piano and were trained on sample databases provided by Magenta for simplicity.
Two python scripts are implemented in the pipeline: 1) to manage the consistent generation of new music, and 2) to play the generated music through a virtual port. The first script gathers a recorded MIDI file created by the player and passes it to a pre-trained RNN model. This then uses the pre-trained model to render novel MIDI music and save the generated music to the local machine. The second script utilizes a python package called mido (already present in the Google magenta environment) to handle the reading of the MIDI file through a virtual MIDI port.
All audio is managed the Ableton Live 11, a widely used digital audio workstation (DAW). Managing audio through a DAW such as Ableton Live 11 has several benefits.
One such benefit is that DAWs can play and record through virtual ports, reducing the number of scripts that need to be run to play or record audio. Virtual ports also afford the data to be internally circulated enabling further customization using MAX patches or other software.
Another is the ease of customization of tracks and associated sound software. Two tracks in Ableton are used to manage the input and output of the AI-driven musician. One track uses a MAX patch that records the incoming MIDI information from the player. The second track has drum (or other instrument) sounds that are connected to a virtual MIDI port that plays the output of the MIDI from the generated music (see Figure 1).
Animation of a VM is accomplished using Unity (v2020.3.16f1). Characters were chosen from the Adobe Mixamo database, and were animated using the Animation Rigging package provided by Unity (see Figure 2, 3, and 4). The animations are triggered by incoming OSC messages that correspond to the MIDI notes being generated in real time.
MIDI information in a virtual MIDI port is converted to OSC messages using ofxOscMidi, a free software developed by Andreas Fischer and can be downloaded from their GitHub page 3. The OSC messages are sent via the local host ip address (ip = 127.0.0.1) to Unity which receives the messages and triggers an animation.
Specific animations depend on the instrument being played by the VM. For drum animations an arm motion is triggered, while idle animations include slight swaying of the body and a natural-looking idle head motion. More complex movement, such as guitar fingering and picking, pose a significant challenge for animation. We are interested in exploring more advanced methods for nuanced animation control in future work, and we are currently working towards a machine learning algorithm dedicated to guiding motion of the avatar.
There are two main interaction paradigms that are made possible by the system. The first is a continuously listening VM. Continual listening or free-form improvisation is the most utilized interaction technique in intelligent interaction to date . This strategy continuously records player input and also generates output continuously based on the music produced by the player. Before interviewing musicians this interaction paradigm was the only implementation. The second interaction paradigm was implemented to address the desires of those interviewed.
The second main interaction paradigm entails a player dictating to the VM when to begin listening and when to stop listening. While both types of interaction are present in the system, given player feedback from interviews we also built a second type of interaction. To accomplish this players were given a foot pedal (iKKEGOL USB Triple Foot Switch, model FS2020U1IR) with a "start" and "stop" button that are programmed to provide key commands (see Figure 5). Players are instructed to press "start" when the VM should begin listening and "stop" when the VM stops listening. A "play" button is included to ensure the VM begins playing (or stops playing) when the human player desires.
Domain experts (average years playing instrument = 20.2 years; 8 Male, 2 Female; Age: M=38, SD=15.8, Professional = 5, Non-Professional = 5) were interviewed using a semi-structured interview approach . Interviews were conducted in three steps: 1) discussion of their personal musical practice, 2) interaction with a VM (continuous listening), and 3) speculation of how the use of AI may aid in their musical practice. This strategy enabled the interviewers to gain insight into the musical practice of the players before demonstrating the VM technology and gave context to how players could utilize AI-driven technology in their personal practice. The participants were informed that the objective of the study was to gain insight into their overall attitudes and preferences. They were also advised that there were no right or wrong answers, and that the researchers were interested in obtaining their authentic opinions and reflections.
10 musicians were interviewed independently for approximately 2 hours each. All participants were given the option to end the interview whenever they desired. The interviewer asked the three basic questions to guide discussion: 1) describe your musical practice, 2) please provide your thoughts on the VM system, and 3) how could AI be used to improve your musical practice?, following the three step interview process. All interviewed musicians either play in a band, teach music professionally, or have a consistent personal musical practice (several times per week).
Interviews were recorded and transcribed for post-interview thematic analysis. We calculated and sorted the most used words and phrases by frequency to inform our decisions for selecting general themes  . We then located the words and phrases within the responses for context and decided on main themes that adequately described all of the written content. We applied the themes and clustered the data for the following discussions.
Several insights were gleaned from interviewing practicing musicians that largely depended on their level of play.
Non-professional practicing musicians demonstrated interest in tools that analyze their play. For example, let them know when they are off beat. One player stated, "I think I am often behind the beat, so it’s really nice to have some feedback from another musician or even a computer program".
Professional musicians were mostly interested in musical accompaniment. They suggested practice techniques that included playing at faster tempos with a backing track (pre-recorded instrumental track) or having accompaniment that they could control to challenge their style and abilities on the instrument. This was true for practice, composition, and musical improvisation.
Professional musicians also indicated the use of a DAW to aid in composition. Which enabled them to program musical parts they themselves could not play. Short (1 min or less) compositions were also indicated as a consistent musical practice for creating ideas without the pressure of a full composition.
The non-professional musicians raised the issue of desiring someone to play with while practicing. The social aspect of music was highly motivating for the players who did not play with others regularly. Therefore, practicing with others or using technology to accomplish common musical practice that one could not otherwise do alone (e.g. play a chord progressions, or generate new musical ideas) were of interest. Personified representations, though not as satisfying as playing with another human, presented an opportunity for enhancing engagement and co-presence with the VM.
Several issues were found when using the VM system by both professional and non-professional musicians.
The most frequent issue found when using the VM system was a lack of knowing when the VM was listening and deriving information from the musician. This finding occurred only while undergoing the continuous free-form play interaction paradigm. The continuous listening did not provide players with feedback. With one player stating, "is it responding to me?". Others suggested that it was likely listening and responding to them, but stated that it was unclear, "It sounds like it is listening to me! But, I’m not sure when". This seemed to arise from a lack of visual feedback to the player. The audio-reactive VM animations did not inform the player of when the VM was listening and what it was responding to exactly. This may have been remedied by including obvious musical repeated phrases learned from the player or a form of visual communication and will be explored in future work. We also see this as a shortcoming of the current animations. Further investigation into the types of animation and communicative information present in gesture will be conducted in future research.
Other usability issues reflected the timing of the VM. One participant claimed, "I think it’s coming in off-time, like behind or ahead of the beat when I’m playing". This concern stemmed from the time it took to generate the music, as well as slight variations in tempo from the human player after the VM had begun generating music.
Additionally, one of the professional musicians was not interested in additional visuals at all and desired an audio-only experience. However, the musician felt that a more expressive VM would have been more engaging, and could provide additional information in the presence of multiple VMs. For example, if two VMs were used, being able to tell which was playing by motion or instrument of the VM would be helpful for controlling the VMs explicitly. It is worth noting that this musician was more interested in controlling the VM through pedals or UI to practice improvisation or aid in composition. A trend that emerged when discussing the incorporation of AI in musical practice.
Other musicians desired the use of VM representation as they suggested it is “interesting technologically and helps me feel as though I’m playing with others…like more connected and invested in playing”. This helped some non-professional musicians feel as though they had more motivation to use the system and perhaps could be fun to use with others. These as well as other musicians commented that the current animations suffice for interaction but desired improvements to enhance engagement.
Several players were enthused about incorporated AI-driven technologies into their musical practice. The vast majority of the players suggested using the AI to generate backing tracks to accompany them, rather than have an AI partner to play with. One player illustrated this by saying, "I don’t know that I need another jam partner taking solos. I think I’d rather have a AI-driven jam partner that pushes me musically, but only when I want or need it". This theme became clear as most players reported spending their time practicing playing complex melodies (with respect to their skill level) rather than perfecting their ability to support other players.
It was also clear that players, regardless of skill level, wanted an aid to practice or composing when using AI. One musician stated, "I can’t play every instrument, and I definitely can’t play every instrument at once. So, it would be nice to have AI that could match my style and play really cool supporting parts". This particular musician was reflecting on the composition process and made it clear that it is a hassle to come up with parts for instruments they don’t practice or play while composing using a DAW.
Almost all musicians wanted to use the VM’s intelligent capabilities for practice aides and compositional aid. Non-professional musicians were also enthused about the capability of engagement and the beneficial social aspect of other personified representations. Many suggested using VMs to feel less alone when practicing, to be politely corrected (without fear of social recourse or embarrassment), and to enhance practice sessions between friends to incorporate additional instrumentation in a natural and engaging way.
After reflecting on feedback from the musicians several changes were made to the system. Namely we were interested on fixing two critical usability issues: 1) ensuring people felt the VM was responding to them, and 2) that musical timing issues were addressed (see Figure 6). We then began addressing the concern of players wanting more explicit control of the AI for the purpose of training or composing.
To address the usability issues we began by creating a foot pedal controlled AI. This ensured that the players felt a sense of the VM listening to them because they were explicitly controlling it. Players pressed a "start" button and a "stop" button on a foot pedal to control what part of their playing they wanted the VM to respond to.
We saw an opportunity to utilize animations to mitigate some of the issues that arose from contending with issues regarding whether the VM was listening. Perhaps explicit cues from the VM, such as indicating that they are about to play or stop (eg. by lifting their instrument), or that they are listening to the beat (eg. by looking at the player and leaning in) could be implemented to address this.
In a follow up showcase the explicit control of the VM using a foot pedal alleviated some of the issue of whether the VM was listening. The errors associated with tempo were also greatly reduced after this was implemented, however, a more formal follow-up study using the pedal is needed in the future.
This additionally compliments findings in themes I, II, and III (Practice/Education Tools, Social Practice, and Confirmation of Listening respectively) where strategies can be implemented to enhance visual feedback for the player. For example, some players suggested gestural indicators such as tapping of the foot, head nodding, or other UI features to indicate timing errors and goals for improvement.
This was also recognized by those who felt more engaged by the use of a VM. We also see opportunities for further use of interfaces (e.g. foot pedals) or visual cues (e.g. VM animations) to address the lack of communication between the player and the VM.
There are several inherent limitations in the proposed AI system. The most prohibitive limitations surround timing issues. The processes underpinning the AI take time to compute and are presumably due to conversion of the MIDI data that is input into the system. This is an issue in many real-time systems as many need to listen and then respond, limiting the amount of attention that can be represented in the model and the type of system implemented.
Shortening the length of the output made the process of music generation faster, however, there is substantial work that can still be done to make this process more efficient. Because the system uses the Google Magenta platform for converting and processing the information and this system was not optimized for speed, it is an inherent limitation of the system. This issue can be mitigated by using another method of music generation, but may compromise accessibility of the system as a whole. Work in the future will prioritize custom models that will be optimized for speed.
Other timing issues are caused by differences in tempo over time. This can be fixed in future iterations by making the tempo dynamic. This can be achieved in several ways, but the easiest is likely to arise from handling in a MAX for Live patch, within the DAW itself, or perhaps building a separate module to further control timing.
Further customization of the animations can be made inside Unity using animation rigging or manipulating captured motion. There are several methods that can be used to make animations more expressive and we encourage developers to explore these options. A future exploration implementing gestures as a means of providing feedback and communication with the human player is of great interest to the research team.
Per musician suggestion we will begin by exploring gestures related to listening, as an aid for timing, and for social engagement. Ultimately, this capability will afford the greatest benefit for VMs, and need not be limited to the VM themselves. For example, graphical user interfaces and XR tools can be utilized to leverage the potential benefits of a more visual system.
Future work will include making the system work with any DAW. This will further the aim of making the system less expensive. Though there are more possibilities for interaction while using a DAW like Ableton, it may be cost prohibitive for some and can be accomplished in the future using only python packages, music handling software such as PureData, or open sourced DAWs.
Future work will also focus on increasing the number of low-cost solutions for human interaction as the goal of a user-friendly system is for wide-spread use. These may include, but are not limited to, computer vision, and other low-cost sensing techniques that can be used to inform the AI. Implementing these solutions will require careful attention to the load placed on the CPU, GPU, and RAM, but may greatly enhance the system as a whole.
In this paper we introduce an AI-driven, animated, VM pipeline aimed at making AI-musicianship more accessible to musicians, educators, and researchers. This paper uses semi-structured interviews to incorporate changes to a human-in-the-loop system that benefits domain experts. Interviews with musicians and thematic analysis revealed several insights into how AI-driven music interaction systems can be incorporated into musical practice. Interview findings suggested that many musicians desire interactive AI-driven systems that are explicitly controlled by the player as compared to autonomous VMs. Several challenges and recommendations for improvement are outlined as well as potential solutions for future development.
Python Backend Code
Ableton Set and Max for Live Script
The authors would like to thank Rishi Vanukuru for his major contributions and development of the virtual musician system in Unity. Suibi Weng, for his continued support in Unity and assistance with virtual musicians. Chad Tobin for assistance in data collection and training neural networks. The authors would also like to thank the Ericsson Research team who continuously supported this work as part of a grant and internship project. We also acknowledge notable support from Gunilla Berndtsson, Per Karlsson, Per-Erik Brodin, and Amir Gomroki. A special thanks to the study participants who donated their time and expertise, and the reviewers for their thoughtful contributions.
The authors declare that the work presented was conducted in the absence of any conflict of interest (related to either commercial, financial or personal relationships) and in line with Principles & Code of Practice on Ethical Research. This material is based upon work in part supported by a grant from Ericsson Research.