Abstract. Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.
💡 Audio Flamingo has strong audio understanding capability
💡 Audio Flamingo supports in-context few-shot learning and retrieval augmented
generation
💡 Audio Flamingo can have multi-turn dialogues with users under complex context
Audio Flamingo not only achieves the state-of-the-art on audio understanding benchmarks, but also understands
audio in depth.
Below we show Audio Flamingo can understand (1) occurrences in audio, (2) order of sounds, (3) quality of
audio, (4) loudness and it's change over time, and (5) distance of sounds.
Audio | Instruction | Ground-truth | Baseline | Audio Flamingo |
---|---|---|---|---|
how many different voices are there? | three | one | three | |
how many times is there a loud screeching sound? | three, many | two | three | |
how many musical instruments were heard throughout the video? | two | one | two | |
how many different objects are being struck? | three | one | three |
Audio | Ground-truth | Baseline | Audio Flamingo |
---|---|---|---|
Instruction: Describe this sound |
a man is speaking followed by applause | a crowd of people are speaking and applauding | a man is speaking and then the audience claps |
Instruction: Describe this sound |
a person burps followed by laughter | someone is laughing and then they are laughing | a burp is followed by laughter and speech |
Instruction: Describe this sound |
a high pitched sneeze followed by sniffling | a person is sniffing and breathing heavily | a person sneezes and then breathes heavily |
Instruction: Describe this sound |
a man talking then whistling | a man is speaking and a woman is speaking | a man is speaking and then whistles |
Audio | Instruction | Ground-truth | Baseline | Audio Flamingo |
---|---|---|---|---|
What is the overall sound quality of the audio? | the sound quality of the audio is clear and well-produced, with a focus on the instruments and vocals | The overall sound quality of the audio is loud and intense, with a variety of different sounds that create a sense of chaos and activity in the environment | The overall sound quality of the audio is clear and crisp | |
how would one describe the quality of the conversation? | bad | radio | bad | |
What is the atmospheric quality of the audio? | the atmospheric quality of the audio is its ability to create a sense of space and depth, with a focus on creating a sense of mystery and wonder | The audio has a rich and dynamic quality, which suggests a lively and engaging atmosphere, possibly associated with music or a musical performance, but it is not specified in the given audio events. | the audio has a melancholic and introspective atmosphere | |
describe the audio in detail. | the audio is a unique blend of different genres that creates a distinct sound. the ambient and trance elements give it a calming and meditative feel | The audio clip contains a ringtone and a telephone | the music is slow and has a soothing quality to it |
Audio | Instruction | Ground-truth | Baseline | Audio Flamingo |
---|---|---|---|---|
What is the most striking characteristic of the audio? | The most striking characteristic of the audio is its powerful and uplifting nature, which makes it a great choice for anyone looking for an energetic and inspiring listen | The audio clip features a guitar playing a fast and aggressive riff with drums in the background | The most striking characteristic of the audio is its loudness and intensity | |
Describe the sound in a sentence. | a man briefly talks followed by a loud burst then laughter | a person is laughing and a gun is fired | a man is speaking and then a loud bang occurs | |
How does the soundtrack build tension and drama? | The soundtrack uses dynamic shifts in tempo, instrumentation, and volume to build tension and create a sense of dramatic urgency | The soundtrack builds tension and drama by using a combination of different musical elements, such as crescendos and sudden changes in tempo | The soundtrack gradually builds tension and drama through its use of orchestral instruments, such as violins and cellos, which create a sense of drama and tension | |
Describe the sound in a sentence. | a loud distorted belch followed by a series of burping | a loud burp and a loud burp | a person burps loudly and repeatedly |
Audio | Instruction | Ground-truth | Baseline | Audio Flamingo |
---|---|---|---|---|
Describe this sound. | a man talking followed by another man laughing as rain falls and thunder roars in the distance | a man is talking and rain is falling. | a man is talking in the rain with thunder in the background. | |
is a vehicle moving in the distance? | yes | no | yes | |
are there children talking in the distance? | yes | no | yes | |
Describe this sound. | a bus engine driving in the distance then nearby followed by compressed air releasing while a woman and a child talk in the distance | a truck is idling and idling | a truck idles in the distance and a child screams |
Audio Flamingo can be adapted to new tasks via in-context few-shot learning and retrieval augmented
generation, without task-specific fune-tuning.
Below we show how Audio Flamingo can (1) borrow key words from retrieved samples on audio captioning and (2)
disregard retrieved samples if they are noisy and ineffective.
Audio | Ground-truth | Baseline | Audio Flamingo | Retrieved Samples |
---|---|---|---|---|
Instruction: Describe this sound |
a person snoring repeatedly | someone is snoring loudly | a person is snoring in a loop |
|
Instruction: Describe this sound |
paper is crinkling and crumpling, and thumping occurs | movement of a person in a kitchen with pots and pans on a counter | a person crumples up a sheet of paper and crinkles it |
|
Instruction: Describe this sound |
pigeons are cooing flapping and a man talks | pigeons coo and a man speaks | a man speaks while pigeons coo and flap their wings |
|
Instruction: Describe this sound |
a woman speaks, and faucet water runs | a woman is talking to a child | a woman speaks while water runs from a faucet. |
|
Audio Flamingo can have multiple rounds of dialogues with the user.
It understands complex context and captures correlations between rounds; for example, it understands
pronouns properly.
Audio | Dialogue |
---|---|
User: Can you briefly explain what you hear in the
audio? Ground-truth: The audio primarily consists of bird sounds, including chirping and bird vocalizations. There are also sounds of other animals. Additionally, there is a woman speaking at certain intervals. Baseline: Birds are chirping. Audio Flamingo: The audio features a variety of sounds including bird vocalizations, mechanical sounds, impact sounds, a man speaking, and a ticking sound. User: Are the bird sounds continuous? Ground-truth: The bird sounds are not continuous. They occur at different intervals throughout the audio, with some overlap with the woman's speech. Baseline: Yes, the bird sounds are continuous. Audio Flamingo: No, the bird sounds are not continuous. They occur at different intervals throughout the audio. User: Do the bird sounds and the person's speech overlap? Ground-truth: Yes, there are instances where the bird sounds and the woman's speech overlap. Baseline: Yes, the bird sounds and the woman's speech overlap. Audio Flamingo: Yes, the bird sounds and the person's speech do overlap at certain points in the audio. |
|
User: Is the male voice rapping or singing? Ground-truth: The male voice is doing both, rapping and singing in the audio. Baseline: The male voice is rapping. Audio Flamingo: The male voice is rapping. User: What kind of music is it? Ground-truth: The type of music isn't specified, but it's accompanied by rapping and male singing. Baseline: The kind of music is hip-hop. Audio Flamingo: The music is a rap song. User: Is the male voice the only one present? Ground-truth: Yes, the male voice is the only one specified in the audio, both rapping and singing. Baseline: Yes, the male voice is the only one present. Audio Flamingo: Yes, the male voice is the only one present in the audio. |
|
User: Is the rattle sound followed immediately by the
spray sound? Ground-truth: Yes, the rattle sound is immediately followed by the spray sound with a very short gap in between. Baseline: Yes, the rattle sound is followed immediately by the spray sound. Audio Flamingo: Yes, the rattle sound is followed immediately by the spray sound. User: Does the spray sound occur only once? Ground-truth: No, the spray sound occurs twice in the audio. Baseline: Yes, the spray sound occurs only once. Audio Flamingo: No, the spray sound occurs twice in the audio. User: Is the second spray sound longer than the first one? Ground-truth: Yes, the second spray sound lasts longer than the first one. Baseline: Yes, the second spray sound is longer than the first one. Audio Flamingo: Yes, the second spray sound is longer than the first one. |
|
User: What genre does this music belong to? Ground-truth: This music belongs to the R&B/Disco genre. Baseline: It is not specified in the given audio clip. Audio Flamingo: This music belongs to the R&B genre. User: Can you tell me more about the vocals in this track? Ground-truth: Sure, the vocals in this track are from a male singer who is singing in a higher register. His voice has been modified with effects, making it sound like a voice sample. Baseline: The vocals in this track are described as rich and full-bodied, but no further details are given about the specific singer or the lyrics being sung in the audio clip. Audio Flamingo: The vocals in this track are by a male singer. His voice is deep, resonant, and powerful, adding to the overall emotional impact of the song. |