NVIDIA has just revealed its new generative AI model that it claims is the “Swiss Army knife for sound,” with many more capabilities beyond just modifying and composing a song that most audio AI models offer today. Team Green named it “Fugatto“, short for Foundational Generative Audio Transformer Opus 1.
NVIDIA Fugatto AI
In NVIDIA’s words, “[Fugatto] generates or transforms any mix of music, voices and sounds described with prompts using any combination of text and audio files.” One potential usage is for music producers to quickly prototype a song, and another use case describes that advertising agencies can use the model to produce localized advertising content with different accents according to each population.
And then there’s also the part where it’s less comfortable with the public at large, which includes using someone’s voice for personalization’s sake: “Imagine an online course spoken in the voice of any family member or friend.” NVIDIA’s words, not mine – and it’s pretty clear by now that people are more concerned with such technology getting abused by scammers to improve their scamming tactics than what is described here.
Back to the tech bit. The secret sauce of Fugatto is the ability to generate a sound combination that is about as anti-synergy as you’d imagine, like a barking trumpet, or a meowing saxophone. NVIDIA says the model can “handle tasks it was not pretrained on,” and that should open quite a lot of possibilities on what sound can be produced by this AI model. Speaking of training, Fugatto is trained on “a bank of NVIDIA DGX systems” with 32 NVIDIA H100 GPUs doing the heavy lifting, with the full version featuring 2.5 billion parameters.
NVIDIA made no mentions of whether Fugatto will be available to the general public – but if we’ve learned anything, generative AI models are big double-edged swords that can be especially problematic if fallen to the wrong hands, which Google had to be especially careful when it eventually opened up its equivalent, MusicLM, to the public with some features restricted likely due to copyright concerns.
Pokdepinion: This technology certainly can go both ways.