ABSTRACT
An increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field is still quite barren. This paper introduces a manually annotated dataset for an audio caption. The purpose is to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image. The whole dataset is labeled in Mandarin and we also include translated English annotations. A baseline encoder-decoder model is provided for both English and Mandarin. Similar BLEU scores are derived for both languages: our model can generate understandable and data-related captions based on the dataset.
Index Terms— Audio Caption, Audio Databases, Natural Lan- guage Generation, Recurrent Neural Networks
1. INTRODUCTION
Current audio databases for audio perception (e.g. AudioSet [1], TUT Acoustic Scenes database [2], UrbanSound dataset [3]) are mainly segmented and annotated with individual labels, either free of choice or fixed. These datasets provide a systematic paradigm of sound taxonomy and generate relatively great results for both clas- sification [4] and detection tasks [5]. However, audio scenes in real world usually consist of multiple overlapped sound events. When humans perceive and describe an audio scene, we not only detect and classify audio sounds, but more importantly, we figure out the inner relationship between individual sounds and summarize in nat- ural language. This is a superior ability and consequently more chal- lenging for machine perception.
Similar phenomenon is present with images, that to describe an image requires more than classification and object recognition. To achieve human-like perception, using natural language to de- scribe images([6, 7, 8]) and videos has attracted much attention([9, 10, 11]). Yet only little research has been made regarding audio scenes[12], which we think is due to the difference between visual Heinrich Dinkel is the co-first author. Kai Yu and Mengyue Wu are the corresponding authors. This work has been supported by the Major Program of National Social Science Foundation of China (No.18ZDA293). Experi- ments have been carried out on the PI supercomputer at Shanghai Jiao Tong University.
and auditory perception. For visual perception, spatial information is processed and and we could describe the visual object by its shape, colour, size, and its position to other object. However for audio sounds, the traits are to be established. Auditory perception mainly involves temporal information processing and the overlap of multi- ple sound events is the norm. To describe an audio scene therefore requires a large amount of cognitive computation. A preliminary step is to discriminate the foreground and background sound events, and to process the relationship of different sounds in temporal order. Secondly, we need to acquire our common knowledge to fully under- stand each sound event. For instance, a 3-year-old child might not entirely understand the meaning of a siren - they could not infer that an ambulance/fire engine is coming. Our common knowledge accu- mulates as we age. Lastly, for most audio scenes involving speech, we need access to the semantic information behind speech signals to fully understand the whole audio scene. Sometimes further reason- ing based on the speech information is also needed. For example, through a conversation concerning treatment option, we could spec- ulate this might be between a doctor and a patient.
Nevertheless, current machine perception tasks are mostly clas- sification and detection. In order to help machine understand audio events in a more human-like way, we are in need of a dataset that en- ables automatic audio caption. Its aim is to automatically generate natural language to describe an audio scene. Broad application can be expected: hearing-impaired people can understand the content of an audio scene and detailed audio surveillance will be possible. Just as humans need both audio and visual information for comprehen- sive understanding, the combination of these two channels for ma- chine perception is inevitable. It is therefore practical and essential to have a tool to help process audio information. Further, English is quite a dominant language in captioning field. Among the limited attempts to caption images in other languages, a translation method is usually used (e.g. [13]) and to our knowledge, there is no audio dataset specifically set up for Mandarin caption.
Section 2 will introduce the Audio Caption Dataset and detailed data analysis can be found in Section 3. Baseline model description is provided in Section 4. We present human evaluation results of the model generated sentences in Section 5.
2. THE AUDIOCAPTION DATASET
Previous datasets (e.g. AudioSet [1]) mostly concentrate on individ- ual sound class with single-word labels like music, speech, vehicle etc.. However these labels are insufficient to probe the relationship
between sound classes. For instance, provided with labels ‘speech’ and ‘music’ for one sound clip, the exact content of an audio re- mains unclear. Further, although AudioSet contains as many as 527 sound labels, it still cannot include all sound classes, especially some scene-specific sounds.
This database departs itself from previous audio datasets in four aspects 1)the composition of more complicated sound events in one audio clip; 2) a new annotation method to enable audio caption; 3) the segmentation of specific scenes for scene-specific audio process- ing; 4) the use of Mandarin Chinese as the natural language. We also include translated English annotations for broader use of this dataset. We identify five scenes that might be in most interest of audio cap- tion - hospital, banking ATMs, car, home and conference room. We firstly reveal our 10h labelled dataset on hospital scene and will keep publishing other scenes.
Source As audio-only information is not sufficient for determin- ing a sound event, we included video clips with sound. All video data were extracted from Youku, Iqiyi and Tencent movies which are video websites equivalent to Youtube. They also have exclu- sive authorization of TV shows, thus some of the video clips were from TV shows and interviews. When sampling the data, we limited background music and maximized real-life similarity. Each video was 10s in length, with no segmentation of sound classes. Thus our sound clips contain richer information than current mainstream datasets. The hospital scene consists 3710 video clips (about 10h duration in total, see Table 3).
Annotation We think of four traits to describe the sound evetns in an audio clip: its definition (what sound is it), its owner (what’s making the sound), its attribute (how does it sound like), its location (where is the sound happening). Almost every event scene can be un- derstood via these four traits, e.g. “Two dogs are barking acutely”,“A bunch of firemen are putting out fire and there are people scream- ing”.
Each video was labelled by three raters to ensure some variance of the dataset. All human raters received undergraduate education and were instructed to focus on the sound of the video while la- belling. They were asked to label in two steps:
1. Answer the following four questions:
1) list all the sounds you’ve heard;
2) who/what is making these sounds;
3) how are these sounds like;
4) where is the sound happening.
2. Use natural language to describe the audio events;
The questions in Step 1 are to help generate a sentence like the description in Step 2, which are the references in our current task. All the labelling language was chosen freely as there is great subjec- tive variability in human perception of the audio scene.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme