Sony Introduces An AI Tool For Single-Instrument Accompaniment While Creating Music

In recent decades, engineers have increasingly turned to artificial intelligence (AI) to create tools that assist creative professionals by improving and accelerating content production. These advances include models capable of generating musical tracks to aid in music production.

Researchers at Sony CSL are developing a variety of AI-based tools to support musicians, music producers, and enthusiasts in their creative pursuits. They recently introduced Diff-A-Riff, an innovative model that creates high-quality instrumental accompaniment for any musical composition, as described in a paper on the arXiv preprints server. “Our recent paper is a follow-up to our previous work on bass accompaniment,” the Sony CSL Paris music team said. “We used to focus on creating basslines to enhance existing tracks. Now, with Diff-A-Riff, we’ve expanded our focus to create single-instrument accompaniment for any type of instrument.” “The evolution of this instrument was driven by the practical needs of music producers and artists looking for tools to enrich their compositions by adding a variety of instruments and maintaining flexibility in terms of instrument types and timbres.”

The main goal of the music team at Sony CSL Paris was to develop a universal artificial intelligence system capable of creating high-quality instrumental accompaniment that fits perfectly with a specific musical context, focusing on one instrument at a time. This tool uses two advanced deep learning techniques: hidden diffusion models and coherence autoencoders. “Diff-A-Riff uses the power of hidden diffusion models and coherence autoencoders to generate instrumental accompaniment that matches the style and tonal quality of the music,” they explained. “The system first compresses the incoming audio into a latent format using a pre-trained coherence autoencoder, a codec we developed in-house, delivering high-quality decoding through a generative decoder. This compressed latent representation is then fed into our latent diffusion model, which generates new audio based on the input context and additional style references from the text or audio embedded.”

Diff-A-Riff offers numerous advantages over existing instrumental accompaniment tools. One of the key advantages is its universal control, which allows users to optionally use both audio and text prompts, providing greater flexibility in controlling the creation of accompaniment. In addition, Diff-A-Riff produces high-quality output, including 48kHz pseudo-stereo audio. “Diff-A-Riff also significantly reduces inference time and memory usage compared to previous systems by using a 64x compression ratio,” the team noted. “It allows you to create accompaniment that suits any musical context, making it an invaluable tool for music producers and artists.” “Additional controls include the ability to interpolate between tooltips and tooltips, set the stereo width, and create smooth transitions for loops.”

The Sony CSL team evaluated their model through a series of tests, finding that it produced high-quality instrumental accompaniment for a variety of tracks, with listeners indistinguishable from accompaniments played by live musicians. “The generation speed of three seconds for one minute of audio is unprecedented and is achieved thanks to the high compression ratio of the coherence autoencoder,” they said. “In real-world applications, Diff-A-Riff can be used for music production, creative collaboration, and sound design.”

Sony CSL’s instrumental accompaniment creation tool will be a valuable asset to music producers worldwide, allowing them to create additional instrumental tracks for their compositions. Artists can use Diff-A-Riff to explore new musical ideas, and sound engineers can quickly test different timbres and playing styles. “Our future research aims to expand the capabilities of Diff-A-Riff by improving control mechanisms and exploring integration into the music production workflow,” the team said. “We strive to make the model more intuitive and accessible to both amateurs and professionals. In addition, we plan to work with musicians and composers to further improve our model, ensuring that it meets the practical needs of users in the music industry.”

Other posts