Audio Flamingo

Key Features

💡 Audio Flamingo has strong audio understanding capability
💡 Audio Flamingo supports in-context few-shot learning and retrieval augmented generation
💡 Audio Flamingo can have multi-turn dialogues with users under complex context

Audio

Instruction

Ground-truth

Baseline

Audio Flamingo

how many different voices are there?

three

one

three

how many times is there a loud screeching sound?

three, many

two

three

how many musical instruments were heard throughout the video?

two

one

two

how many different objects are being struck?

three

one

three

Audio

Ground-truth

Baseline

Audio Flamingo

Instruction: Describe this sound

a man is speaking followed by applause

a crowd of people are speaking and applauding

a man is speaking and then the audience claps

Instruction: Describe this sound

a person burps followed by laughter

someone is laughing and then they are laughing

a burp is followed by laughter and speech

Instruction: Describe this sound

a high pitched sneeze followed by sniffling

a person is sniffing and breathing heavily

a person sneezes and then breathes heavily

Instruction: Describe this sound

a man talking then whistling

a man is speaking and a woman is speaking

a man is speaking and then whistles

Audio

Instruction

Ground-truth

Baseline

Audio Flamingo

What is the overall sound quality of the audio?

the sound quality of the audio is clear and well-produced, with a focus on the instruments and vocals

The overall sound quality of the audio is loud and intense, with a variety of different sounds that create a sense of chaos and activity in the environment

The overall sound quality of the audio is clear and crisp

how would one describe the quality of the conversation?

bad

radio

bad

What is the atmospheric quality of the audio?

the atmospheric quality of the audio is its ability to create a sense of space and depth, with a focus on creating a sense of mystery and wonder

The audio has a rich and dynamic quality, which suggests a lively and engaging atmosphere, possibly associated with music or a musical performance, but it is not specified in the given audio events.

the audio has a melancholic and introspective atmosphere

describe the audio in detail.

the audio is a unique blend of different genres that creates a distinct sound. the ambient and trance elements give it a calming and meditative feel

The audio clip contains a ringtone and a telephone

the music is slow and has a soothing quality to it

Audio

Instruction

Ground-truth

Baseline

Audio Flamingo

What is the most striking characteristic of the audio?

The most striking characteristic of the audio is its powerful and uplifting nature, which makes it a great choice for anyone looking for an energetic and inspiring listen

The audio clip features a guitar playing a fast and aggressive riff with drums in the background

The most striking characteristic of the audio is its loudness and intensity

Describe the sound in a sentence.

a man briefly talks followed by a loud burst then laughter

a person is laughing and a gun is fired

a man is speaking and then a loud bang occurs

How does the soundtrack build tension and drama?

The soundtrack uses dynamic shifts in tempo, instrumentation, and volume to build tension and create a sense of dramatic urgency

The soundtrack builds tension and drama by using a combination of different musical elements, such as crescendos and sudden changes in tempo

The soundtrack gradually builds tension and drama through its use of orchestral instruments, such as violins and cellos, which create a sense of drama and tension

Describe the sound in a sentence.

a loud distorted belch followed by a series of burping

a loud burp and a loud burp

a person burps loudly and repeatedly

Audio

Instruction

Ground-truth

Baseline

Audio Flamingo

Describe this sound.

a man talking followed by another man laughing as rain falls and thunder roars in the distance

a man is talking and rain is falling.

a man is talking in the rain with thunder in the background.

is a vehicle moving in the distance?

yes

are there children talking in the distance?

yes

Describe this sound.

a bus engine driving in the distance then nearby followed by compressed air releasing while a woman and a child talk in the distance

a truck is idling and idling

a truck idles in the distance and a child screams

Audio

Ground-truth

Baseline

Audio Flamingo

Retrieved Samples

Instruction: Describe this sound

a person snoring repeatedly

someone is snoring loudly

a person is snoring in a loop

a person snoring almost wheezing continuously

someone snores through the nose and makes funny noises

continuous snoring

Instruction: Describe this sound

paper is crinkling and crumpling, and thumping occurs

movement of a person in a kitchen with pots and pans on a counter

a person crumples up a sheet of paper and crinkles it

Some material is crumpled

plastic is being crumpled

crinkling of a paper with a small step

Instruction: Describe this sound

pigeons are cooing flapping and a man talks

pigeons coo and a man speaks

a man speaks while pigeons coo and flap their wings

A man talks while pigeons coo and flap wings

A person talks nearby as pigeons coo

Man speaking, doves cooing

Instruction: Describe this sound

a woman speaks, and faucet water runs

a woman is talking to a child

a woman speaks while water runs from a faucet.

a woman speaking and water running

a women voice followed by flowing water from a faucet

water sprays from a faucet, a person speaks

Audio

Instruction

Ground-truth

Baseline

Audio Flamingo

Retrieved Samples

What is the genre of this music?

pop

reggae

jazz

disco

What is the genre of this music?

jazz

classical

What is the instrument of this music?

flute

clarinet

clarniet

clarinet

What is the emotion of this speech?

sad

neutral

fearful

calm

Audio	Dialogue
	User: Can you briefly explain what you hear in the audio? Ground-truth: The audio primarily consists of bird sounds, including chirping and bird vocalizations. There are also sounds of other animals. Additionally, there is a woman speaking at certain intervals. Baseline: Birds are chirping. Audio Flamingo: The audio features a variety of sounds including bird vocalizations, mechanical sounds, impact sounds, a man speaking, and a ticking sound. User: Are the bird sounds continuous? Ground-truth: The bird sounds are not continuous. They occur at different intervals throughout the audio, with some overlap with the woman's speech. Baseline: Yes, the bird sounds are continuous. Audio Flamingo: No, the bird sounds are not continuous. They occur at different intervals throughout the audio. User: Do the bird sounds and the person's speech overlap? Ground-truth: Yes, there are instances where the bird sounds and the woman's speech overlap. Baseline: Yes, the bird sounds and the woman's speech overlap. Audio Flamingo: Yes, the bird sounds and the person's speech do overlap at certain points in the audio.
	User: Is the male voice rapping or singing? Ground-truth: The male voice is doing both, rapping and singing in the audio. Baseline: The male voice is rapping. Audio Flamingo: The male voice is rapping. User: What kind of music is it? Ground-truth: The type of music isn't specified, but it's accompanied by rapping and male singing. Baseline: The kind of music is hip-hop. Audio Flamingo: The music is a rap song. User: Is the male voice the only one present? Ground-truth: Yes, the male voice is the only one specified in the audio, both rapping and singing. Baseline: Yes, the male voice is the only one present. Audio Flamingo: Yes, the male voice is the only one present in the audio.
	User: Is the rattle sound followed immediately by the spray sound? Ground-truth: Yes, the rattle sound is immediately followed by the spray sound with a very short gap in between. Baseline: Yes, the rattle sound is followed immediately by the spray sound. Audio Flamingo: Yes, the rattle sound is followed immediately by the spray sound. User: Does the spray sound occur only once? Ground-truth: No, the spray sound occurs twice in the audio. Baseline: Yes, the spray sound occurs only once. Audio Flamingo: No, the spray sound occurs twice in the audio. User: Is the second spray sound longer than the first one? Ground-truth: Yes, the second spray sound lasts longer than the first one. Baseline: Yes, the second spray sound is longer than the first one. Audio Flamingo: Yes, the second spray sound is longer than the first one.
	User: What genre does this music belong to? Ground-truth: This music belongs to the R&B/Disco genre. Baseline: It is not specified in the given audio clip. Audio Flamingo: This music belongs to the R&B genre. User: Can you tell me more about the vocals in this track? Ground-truth: Sure, the vocals in this track are from a male singer who is singing in a higher register. His voice has been modified with effects, making it sound like a voice sample. Baseline: The vocals in this track are described as rich and full-bodied, but no further details are given about the specific singer or the lyrics being sung in the audio clip. Audio Flamingo: The vocals in this track are by a male singer. His voice is deep, resonant, and powerful, adding to the overall emotional impact of the song.

Audio Flamingo:
A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

Key Features

Audio Understanding

🚀 Audio Flamingo can understand occurrences in audio

In-Context Learning and Retrieval Augmented Generation

🚀 Audio Flamingo can borrow keywords from retrieved samples

Multi-turn dialogue

Instruction	Ground-truth	Baseline	Audio Flamingo	Retrieved Samples
What is the genre of this music?	pop	pop	pop	reggae jazz disco
What is the genre of this music?	jazz	jazz	jazz	classical classical classical
What is the instrument of this music?	flute	flute	flute	clarinet clarniet clarinet
What is the emotion of this speech?	sad	sad	sad	neutral fearful calm

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

Key Features

Audio Understanding

🚀 Audio Flamingo can understand occurrences in audio

In-Context Learning and Retrieval Augmented Generation

🚀 Audio Flamingo can borrow keywords from retrieved samples

Multi-turn dialogue

Audio Flamingo:
A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities