On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding, with Cross-lingual and Cross-modal Specialization

Orateur : Gaëlle Laeprrière (LIA)

Spoken language understanding (SLU) refers to natural language processing tasks related to semantic extraction from speech. Different tasks can be addressed as SLU tasks, such as named entity recognition from speech, call routing, slot filling task in a context of human-machine dialogue.

End-to-end neural approaches have been proposed in order to directly extract the semantics from speech signal, by using a single neural model, instead of applying a classical cascade approach based on the use of an automatic speech recognition (ASR) system, followed by a natural language understanding processing (NLU) module. A main issue of these approaches is the lack of bimodal annotated data (speech audio recordings with semantic manual annotation). Self-supervised learning (SSL), that benefits from unlabelled data, has been successfully applied to several SLU tasks, especially through cascade approaches.

The use of an end-to-end approach exploiting directly both speech and text SSL models is limited by the difficulty to unify the speech and textual representation spaces, in addition to the complexity of managing a huge number of model parameters.

Earlier in 2022, a new promising model was introduced: SAMU-XLSR (https://arxiv.org/abs/2205.08180), for Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. The model combines a state-of-the-art multilingual acoustic frame-level speech representation learning model XLS-R with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder.

We analyzed the performance and the behavior of the SAMU-XLSR model using the French MEDIA benchmark dataset, which is considered as a very challenging benchmark for SLU. Moreover, by using the Italian PortMEDIA corpus, we also investigate the potential of porting an existing end-to-end SLU model from one language (French) to another (Italian) through two scenarios: zero-shot and low-resource learning.

We then investigated the use of cross-lingual and cross-modal representations applied to our SLU task. We searched for a way to exploit the sentence-level embeddings produced by SAMU-XLSR in order to improve the semantics extraction and we proposed an efficient specialization of the SAMU-XLSR model to sentences related to our task. In addition, we presented the benefits of the proposed approach in a language portability purpose.

Lieu de l'événement Lieu de l'événement

Lieu de l'événement