How AI is Transforming Music Composition

jumpstory-download20210701-194249-01.jpeg

Music is central to modern entertainment and culture. It serves as stand-alone art, as a backdrop to daily life, and as an important feature in industries like film and advertising. What if the music you hear on the radio was not composed by a human at all, but by a machine? 

The application of AI in creative spaces like music and art is not new, but recent years have seen a dramatic expanse in machine learning capabilities in music composition. AI is being used by researchers and startups to compose soundtracks and soundscapes, and to create original songs within the style of specific genres and artists. 

Just as work by writers regularly appears alongside robot authors in sports, financial, and other news, human musical artists are increasingly sharing the stage with artificial intelligence (1).

AI composition models work by recognizing and reproducing musical patterns from large quantities of downloaded music. OpenAI’s Jukebox, a project beginning in August of 2019, compiled 1.2 million songs for use in training their model. Jukebox is unique from other algorithmic music composers because it composes its music as raw audio. The model does this by first dramatically decreasing the sampling rate of the audio. It uses a Vector Quantized Variational AutoEncoder (VQ-VAE) to downsample original audio from the standard sampling rate of 44.1kHz down to 344Hz, and then the trained model composes a new song based on the compressed audio file. Similar to the way that language models are trained to predict words given a trailing history of text, the music composition model is trained to predict future audio samples based on the preceding audio. At this point, the new song can be upsampled into wave-form audio (1). 

To create music that most accurately resembles human compositions, the auto-encoder downsamples the audio at three discrete levels, and combines the information from each of them to create the new composition. The top level is the most compressed, and can be used to train the model on long-range features of the target musical genre. Because there is the smallest amount of data for each time frame at this level, the model is able to learn structural elements such as long melodies. The other two levels are less compressed, and so more detail is retained. These levels enable the model to learn “local structure”, such as timbre and overall audio-quality. It should however be noted that the model is not yet able to reproduce long-range structures such as repeating chorus and verse forms, and so its compositions lack some of the characteristic structural elements of human created music (1). 

Screen Shot 2021-06-28 at 10.11.43 AM.png

Jukebox is the first AI composer to include vocals in its compositions. Users can either include their own lyrics or the original lyrics of the song while the model is being trained to provide more context and information, or simply allow the model to compose its own “unseen” lyrics (3). With either option, the model still struggles to align the words correctly with the non-lyrical components of the song, relying on a separate auto-encoder to match the lyrics with the correct audio (1). 

While Jukebox has made impressive strides and represents the cutting edge of algorithmic music composition, its model is remarkably complex and has yet to fulfill a practical purpose beyond providing inspiration and entertainment to its users.  Thankfully, OpenAI is far from alone in their ventures in music composition. Other companies and research projects have been more successful at applying their generated compositions by relying on simpler, more refined algorithms. Notably, AI generated music has been successfully applied in the video and film industries to compose soundtracks for visual content (4). 

“While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a significant gap between these generations and human-created music.” -OpenAI 

Original compositions for movie, podcast, commercial, and video game soundtracks are in high demand and come with high price points and long waiting periods (5). As a result, content creators (like our folks here at Xyonix) are forced to turn to catalogues of stock music. These catalogues, such as Artlist.io and Epidemic Sound, offer large libraries of music that can be downloaded and used in audiovisual content. However, creators must work around copyright limitations and the compositions lack originality and customizability (5). Thanks to recent developments in AI technology, machine learning is now being used to compose original and highly personal music for creators, offering a solution that uses fewer resources and less time. 

Founded in 2016 and funded by the European Union, Artificial Intelligence Virtual Artist (AIVA) is one of the pioneering projects in AI driven soundtrack composition. Some of AIVA’s compositions have received praise internationally, and AIVA is the first artificial intelligence system to be awarded the status of composer by France’s authors rights society (4). 

AIVA’s model was trained using 30,000 classical compositions and relies on a mathematical algorithm representing patterns in music theory. AIVA allows its users to specify metrics such as genre (the website lists twelve genres currently available to its users), tempo, and length. AIVA also composes picture-synced music based on visual information from the video game. In picture-synched composition, data such as gameplay intensity, colors, and action are interpreted in real time and the soundtrack is composed to match the events on-screen.

The result is an auditory experience unlike previous human sound tracking efforts, with dynamic soundscapes that attempt to complement the action and emotion of the gameplay itself (7). 

This tool has made sound tracking video games accessible to independent startups with a wide range of budgets and takes only a fraction of the time. Composing and recording commercial soundtracks can take up to 6 months when using a professional orchestra (4). 

In 2019, a group of researchers at the University of Bristol created a similar system for composing picture-synched musical soundtracks. The research project “combines automated video analysis and computer-generated music-composition techniques to create unique soundtracks in response to the video input”. The researchers believe that their work could one day be used to render human composition of soundtracks obsolete (8). 

Not all systems for soundtrack and background music are picture-synched. Amper AI is a startup that composes background music based on a set of specified metrics. The website allows its users to choose the length, instrumentation, bpm, and mood of their composition, and then download an original audio file free of copyright infringements (6). 

Another growing field in AI music composition is “generative” (not to be confused w/ AI generative transformers) music, a term that refers to music that is generated by a computer algorithm in real time to be adaptive and ever-changing (9). The first ever generative music software, Koan, was released by SSEYO in 1994 (10). The project has continued to develop and in 2018, was packaged as the Wotja application. Wotja auto-mixes music without human input, but also allows users to craft their own soundscapes, a feature that represents the broadening intersection of human and machine-learning composition (11). 

Screen Shot 2021-06-28 at 2.51.52 PM.png

Endel is an application that creates soundscapes based on data from its users and their environment, including time, heart rate, weather, and other inputs. Endel uses light and visuals to accompany its musical compositions, creating a completely personalized environment for users to focus, sleep, or wake up to (12). 

Both “generative” music and AI generated soundtrack compositions are examples of human-machine collaboration, in which human ideas are empowered and made more accessible with automation.

In an era of skepticism and concern surrounding the ability of artificial intelligence to displace human art and creativity, projects like Amper AI, Jukebox, Wotja, and Endel assert that their advancements are not at the expense of human art. OpenAI’s scientists “expect human and model collaborations to be an increasingly exciting creative space” (2).

Additionally, AI’s contributions in music are not only making content creation more convenient and accessible, but also creating new spaces for music to fill. The mere concept of music that responds to factors like weather is simply not possible using human composers, so AI is allowing music to fill new roles that could not have previously existed. 

Thanks to innovations in music composition by AI, we can continue to expect human-robot collaboration in soundtracking and similar industries. In a world saturated with audiovisual content, the need for music has never been higher and AI is rising to the demand. 

References

  1. Jose David Fernandez, Francisco Vico, “AI Methods in Algorithmic Composition:A Comprehensive Survey”. Journal of Artificial Intelligence Research, 11/13.

  2. https://openai.com/blog/jukebox/

  3. https://medium.com/analytics-vidhya/make-music-with-artificial-intelligence-openai-jukebox-6677928bd186

  4. https://cordis.europa.eu/article/id/421438-ai-composers-create-music-for-video-games

  5. Jono Buchanan, “Sound for the screen: A practical masterclass on soundtracking for film and television”. MusicTech, https://www.musictech.net/guides/essential-guide/masterclass-soundtracking-film-television/.

  6. https://www.ampermusic.com/

  7. https://www.aiva.ai/

  8. Vansh Dassani, Jon Bird, Dave Cliff, “Automated Composition of Picture-Synched Music Soundtracks for Movies”. Arvix, 2019.

  9. “Generative Music”, Wikipedia. https://en.wikipedia.org/wiki/Generative_music.

  10. “Koan (program”, Wikipedia. https://en.wikipedia.org/wiki/Koan_(program).

  11. https://intermorphic.com/sseyo/koan/

  12. https://endel.io/