When claiming that AI technology has been ubiquitous in the society, that has only meant until recently that the technology has permeated the social infrastructure. In this statement, the use of the technology was deemed limited to big companies or academics. However, the situation is changing with the rise of ‘generative AI’. As part of wide distribution of ‘apps’, pieces of software oriented for numerous mundane objectives, people can now easily harness it for their everyday tasks, including artistic creation.1
Some ambitious artists have adopted AI much before this shift. They have claimed the special status of the new technology that transcends conventional technological ‘tools’, calling it as a ‘collaborator’ or ‘meta-instrument’, for instance.2 However, the popularity of generative AI seems to have brought attention back to its role as ‘tool’. Expecting its creative potential, people have experimented with such apps. What this paper questions is this expectation. In order to consider the way to incorporate such tools, this paper explores their potential and limitation from a sociological perspective, asking if such tools have a creative potential that would subvert the critical views that thinkers have long bore in relation to technology’s involvement with music (re)production.
Firstly, I briefly describe the concept of ‘generative AI’ with introducing related terms. Following this, I investigate recent AI tools, referring to the sociological concept from two perspectives, the industrialisation and abundance of music. Reflecting on these, the final part considers the possibility of human intervention and argues for the importance of locating the human creator in the generation process as part of the system—the ‘Intelligent Performance System’.
The term, ‘generative AI’ is used in the recent discourse AI, ‘causing a buzz’.3 According to Google Trends, the term’s popularity has skyrocketed since November 2022, coinciding with ChatGPT’s release [Image 1].
Image 1. Screenshot of Google Trends’s result for ‘generative AI’ over the past five years4
It is generally understood by the association with the generation of any ‘content’.5 However, it has no academic definition or clear origin, although it is sparsely found in academic papers. In the technical circle today, the term seems rather eschewed. For instance, when asked about the definition of generative AI, Douglas Eck replies, ‘A generative model can take what it has learned from the examples it’s been shown and create something entirely new based on that information. Hence the word “generative!”’.6 As in this, ‘generative’ is conventionally associated with generative modelling, ‘a branch of machine learning that involves training a model to produce new data that is similar to a given dataset‘, unlike ‘discriminative modeling’, a technique used to analyse characteristics of a given data.7 As it is increasingly used in commercial services that involve production of contents, despite their conceptual difference, generative modelling relates to generative AI.8
An exceptional study that discusses the concept of generative artificial intelligence (GAI) before its buzz is Tijn van der Zant’s PhD dissertation published in 2010.9 For him, GAI, a concept hugely influenced by neo-cybernetics, is distinguished from the conventional AI by its dynamic, self-organising system that continuously interacts with the environment and adapts itself to the conditions inside and outside: the ‘classical AI’ is static and fixed, and its execution has no or little impact on its internal system, whereas GAI’s system continuously changes ‘to structure the next level of adaptive processes which can structure the next level of adaptive processes ad infinitum’.10 In short, whereas the recent understanding of the term attributes its ‘generative’ capacity to the generation of contents, his idea attributes it to that of the system. In addition, as his idea puts the importance on its relationship to the surrounding environment, it opens up the possibility of considering a way to locate its user in relation to the machine, in contrast to the current idea of GAI. In the final part, I come back to his idea, comparing it with contemporary AI tools discussed below.
In popular music studies, one of the main interests has been the industrialisation of music. Within the industrialisation, music ‘creation’ has become music ‘production’ incorporated into the domain of industry and business, and music has become a commodity produced, distributed, and packaged in its logic and the technical infrastructure. Scholars have been concerned about the practical and valuable consequences to music culture.
Adorno’s critique of popular music is one of the earliest expressions of such concern.11 Two of the key concepts in Adorno’s criticism are standardization and pseudo-individualism. He asserted that popular music was ‘standardised’, like other commercial goods, given the simplicity of the musical form and lyrical theme.12 In his view, the standardization was ‘at odds’ with individualization, what he considered ‘a hallmark of genuine art, which always speaks with an individualized voice’, because standardized popular music did not ‘speak with anyone’s voice’.13 Popular music disguises its inability by pseudo-individualism, adopting various techniques by which the industry makes listeners believe that the music expresses an individual voice.14 Although Adorno’s approach has been continuously criticised, the fundamental concepts are still relevant today because they are recurring in different fashions.15
Adorno’s concern resonates with John Philip Sousa’s criticism of ‘canned music’ in the 1900s.16 While it is mostly known for the criticism about recording, the scope of his criticism is broader, as the target is ‘the mechanical device to sing for us a song or play for us a piano, in substitute for human skill, intelligence, and soul’.17 Although the criticism literally reads as his disapproval of the technological involvement of music, it also regards the industrialisation and commodification of music, as it can be interpreted as anxiety towards the human alienation by machinery automation.18 Like Adorno, Sousa’s idea connotates music produced in the factory as commodity, standardised and pseudo-individualised, like canned food.
Discourse around AI music generation apps often employs terms that evoke such concepts. For example, the images attached to an app, MusicStar.AI cite some user comments (although their credibility is doubtful): ‘Best app for music production, gives a new depth to creativity!’; ‘MusicStar.AI offers limitless music generation within a streamlined app.’; and ‘The perfect way for someone to make music. Easy to use and so effective’. ‘Production’ in the first one should be paid attention here, because the term is most commonly used in relation to music industry.19 In addition, as the other comments suggest, the production process is ‘easy’ and ‘effective’ as it is ‘streamlined’: the easiness of the manual labours, effectivity of the production, and the streamlined production are components of industrial mass-production, as represented by Ford’s system.
Examined in this respect, MusicStar.AI’s functioning indicates its connection with standardization and pseudo-individualism, yet in a different sense.20 The app has mainly four functions: music generation, lyrics generation, text-to-speech, and voice changer—the music generation offers a backing track, the lyric generation gives lyrics, and the voice changer allows the user to have it sing or to sing the generate lyrics over the backing track.21 The parameters it relies on for the user input are artists and genres: it allows the user to create lyrics in the style of famous artists and to sing in their voice (Drake, Dua Lipa, Doja Cat, etc.), and the lyrics and backing track can be made according to a genre (pop, hip hop, rap etc.). Judging from these functions, the generation process seems standardised since the way the user controls the generation is only specifying genres and musicians. Yet, in terms of (pseudo-)individualisation, the sense of individuality is not given to the user; rather, it is attributed to the artists. Considering that genres and artists are often a reference point for AI music generation, these observations can apply to other apps and techniques.
In a sense, AI music generation app is a black-box: given material, the box processes it and spits out a product. It is a metaphor for the conventional manufacturing factory, whose production process is prefixed procedure. In this term, the standardization’s principle is the procedure that transforms material into products in a determined, most efficient way. Generative modelling, on one hand, seems to follow this in that it centres on a model, rules for transforming the training datasets into new data, while it accepts external noise [Image 2].
Image 2. The generative modelling (created by the author from an image in Foster (2022))22
However, considering the generation process of the model interrupts its association with the manufacturing factory. In principle, although the model becomes fixed once established, its generation is a more flexible process, as the model based on neural networks shapes itself in its own way. In this sense, if one takes the model as a meta-model, it is not fixed. Thus, if the translation process is opaque, hidden in the black-box, the generation looks like a conventional factory, but the inside of the factory is dynamic, as van der Zant dreamed of, constantly changing its structure.
One consequence of the industrialization is the abundance of music commodities, and this tendency has ever accelerated by the online music distribution. Ironically, although phonograph rendered ephemeral sound tangible, it has become ephemeral again, as is evident in streaming platforms.23
This abundancy is represented not only by streaming but also by music contents distributed as material for other contents, such as royalty-free music for videos and games. When AI tools cater for such demand, they often distinguish themselves from premade tracks, emphasising user’s control over the generating process. The most successful example is Amper Music. Funded by a film composer Drew Silverstein in 2014, it was originally developed as a web-browser-based application for AI-assisted production of soundtracks.24 It may sound strange that the film composer raises money for the service automating his job. Yet, he expects the service to be used in a slightly different area: what he calls ‘functional’ projects, such as ‘commercials’.25 He considers creators in the area are struggling because, due to their budget allowance, they have no choice but to rely on ‘pre-written stock music’, although they want to create tracks tailored for their works.26 His service solves this gap as ‘a fast, affordable and royalty-free way to create the music’.27 It should also be noted that he emphasises the importance of the human-machine collaboration in the generation process.28
However, the company was not free from the influence of ‘stock music’. In November 2020, the company was bought out by Shutterstock, a US-based firm funded by a programmer/photographer Jon Oringer.29 Shutterstock initially only distributed images but expanded later to include videos, website templates, 3D models, and music.30 The acquisition of Amper Music was part of this expansion.31 As the result, Shutterstock now declares ‘Amper Music has moved to Shutterstock’, and the webpage gives customers access to ‘exclusive tracks pre-generated by Amper Music directly from Shutterstock [image 3].32 Customers can search for tracks by keywords as well as musical parameters [image 4].33
Image 3. Screenshot taken by the author34
Image 4. Screenshot taken by the author 35
This shift suggests the duality of the AI technology’s function. On the one hand, Amper Music has utilised the technology to create soundtracks by the collaboration between the human creator and the machine: the human creator instructs the machine to have certain elements to incorporate, the machine accordingly creates music, and the human reviews and edits the output. On the other hand, Shutterstock utilises Amper Music’s AI technology to amplify its music catalogue to the extent that it can hopefully satisfy any needs of its customers. In short, the technology was originally oriented to each specific need, but Shutterstock turned it towards any needs.
Yet, the duality also holds ambiguity. From these cases, the AI technology’s advantage can be summarised in three components: velocity, malleability, and plausibility. The technology makes music faster than the solo human (velocity) in various kinds of style (malleability) that sounds as if it was made by a human creator (plausibility). Assuming that both cases take advantages of these elements, the only difference between the two is the customer’s action for their requirement to be met. The former is the intervention: the user inputs the requirement during the generation process. The latter is the discovery: the user enters the requirement in the search box. However, despite the fundamental differences in their meaning, they have a commonality: in both, the actual action of user consists of inputting parameters and reviewing the result. Of course, the intervention can make music tailored for more detailed contexts than the discovery, for example, specific points of emotional peak in a movie.36 However, the latter can replace the former, if the customer does not need such detailed adjustments—which is likely if the flux of their main contents is instant. The transforming position of Amper Music stands for this from an economic point of view.
The duality and ambiguity are also suggested by a user’s comment on Amper Music’s software:
"With Amper Score™, it is easy for me to find the music I need for the videos that we create,” said Anna Green, a video producer at Minute Media. “It takes me less than five minutes to create music for a short-form video, and the track is the exact mood I’m looking for. At past jobs, it could take me as long as two hours to search for the right track and edit it for the video. The tracks that I make with Amper Score have variation, and I like being able to spot the intro to add a gradual build. I can't imagine a better process than this.”37
In the comment, she expresses how the software helped her to ‘find the music [she needed]’ and that the generated track had ‘the exact mood [she was] looking for’. This suggests that when one creates music with tools, they do not always have a clear vision as to what to make, and they discover it when they are given an output. The generation process is also an act of discovery.38 In addition, comparing her use of the software with her prior experience ’to search for the right track’ by another means, she emphasises how the AI tool effectively worked as a discovery tool as well as an editor. That is in part due to the opacity of AI: as the inner workings of the music generation system are opaque, the distinction between generating contents based on an instruction and presenting pre-generated contents that match search conditions are not always clear to the user. Thus, even when the tool generates music there, it is similar to a discovery tool in her perception, and the only difference is the efficiency. Given these observations, critically speaking, AI music generation tools do not make much difference from pre-made tracks, echoing Sousa’s criticism of canned music, despite the public expectation for the generative capacity.
So far, investigating recent AI music generation tools, I have highlighted their limitations from sociological perspectives, while acknowledging their deviation from the conventional reproduction technology that critical thinkers have problematised. It seems, for me, their failure to take full advantage of the capacities is derived from the fundamental narrowness of the concept of ‘generative’. In what follows, I briefly discuss how van der Zant’s position regarding this concept will present a different path for the development of systems incorporating the generative capacity.
As mentioned earlier, van der Zant’s idea of GAI centres on the generation of the model itself, and its key is the flexibility to respond to the changing environment. The flexibility is, in a sense, the ability of interaction. As Michael Murtaugh argues by comparing Wegner and Goldin’s ‘interactive computing’ with the Turing machine, a true interaction occurs when the machine accepts external inputs that would change the course of its functioning. Although the aforementioned tools allow users to input something, their mode of intervention is not of interaction but rather of instruction.
Interaction is a core element of another type of generative AI tools, chatbots, such as ChatGPT. Although the recent tools have demonstrated remarkable advancements in performance from the preceding ones such as Eliza, their competency compared to humans is still questionable for some scholars. For example, Paul Pangaro criticises ChatGPT, drawing on cybernetician Gordon Pask’s research on human conversation, arguing that the chatbot does not offer any new grounds for subsequent conversations.39 Not surprisingly, this idea resonates with van der Zant’s GAI.
In my view, Pangaro’s criticism about the interactive capacity of the AI is valid because ChatGPT deals with discursive information that has a fixed meaning. This can be different if the information they deal with is that whose meaning is ambiguous, open to various interpretations, such as music. As it determines the response less rigidly than words, the output of the machine can function as a ground on which a new course of interaction emerges. That possibility has been shown by Google Magenta’s programme for the ‘AI Jam Session’.40 The basic function of the system is simple—generating improvisational phrases responding to user’s input on a fixed beat; nonetheless, the generated phrases seem to provoke new ideas in the user, thanks to the lack of the rigid meaning. Indeed, this kind of interaction is rather commonplace for jazz improvisation, as banal phrases of one player often elicit an unpredictable response from another.
Compared with this type of interaction, the shortcoming of the tools discussed earlier is that they do not incorporate the user’s input in the generative process. Whereas these tools accept user input only as instruction, Google’s software above takes advantage of the feedback loop between the machine and the human, as similar things happen in conventional jazz. In conclusion, to pursue truly creative results by utilising generative AI, we need to develop a system that integrates the machine and the human in the generation process. In other words, we should not rely on an automatic tool as a bystander; rather, we should be part of the system with the tool.
There are no potential conflicts of interest (financial or non-financial) involved. This study is part of my PhD research approved by the ethics committee of Goldsmiths, University of London. It does not have any potential societal, social or environmental impact that is detrimental.