Stage

Benchmarking Parameter-Efficient Fine-tuning methodsfor Spoken Language Understanding task

Position type : Sciences et Technologies des langues

M2 / Engineer internship

1 document Published on

Applicant Profile

  • Education: You are pursuing an engineering degree or a master’s program specializing in machine learning, data science, with experience in the following areas:
    – Natural Language Processing
    – Speech processing
    – Machine learning / Deep learning
  • Technical Skills:
    – Proficiency in Python and familiarity with machine learning libraries such as TensorFlow, PyTorch,
    or Keras.
    – Experience in data analysis and speech processing tools.
  • Other Skills:
    – Strong analytical abilities.
    – Ability to work both independently and collaboratively in a research environment.

Objectives

The goal of this internship is to explore the most effective PEFT methods specifically designed for speech-related tasks, such as Spoken Language Understanding (SLU). SLU task aims to extract semantic information from the speech signal such as weather type, date, location…
Recent approaches are based on End-to-end architectures based on an SSL speech encoder with a few additional layers [Laperriere et al., 2023, Dinarelli et al., 2022, Laperriere et al., 2024].

The different tasks of the internship are summarized as follows

  • Literature review: getting familiar with PEFT methods, identifying the most effective ones for speech
    tasks
  • Implementation: ensuring the compatibility of evaluation setups across existing implementations and
    developing any missing functionalities. We will use the SpeechBrain 1 implementation as a starting point,
    extending and refining it to meet our objectives.
  • Benchmark: conducting a systematic evaluation of the selected PEFT methods on a variety of multilingual and multi-domain SLU benchmarks, utilizing cutting-edge SSL models to explore their generalization and robustness.
  • Improvements: assessing the capabilities of PEFT methods, by evaluating their performance in a few-shot scenario, where only a limited number of task-specific examples are available.
    The experimental evaluation will be conducted on multiple SLU benchmarks, namely Speech-MASSIVE (multilingual and multi-domain) [Lee et al., 2024], TARIC (Tunisian dialect, transport domain) [Mdhaffar et al., 2024], and MEDIA (French, hotel booking). The study will consider several state-of-the-art SSL models, including Best-RQ [Chiu et al., 2022], HuBERT [Hsu et al., 2021], and Wav2Vec2 [Baevski et al., 2020].

Context

Speech technologies have advanced rapidly, powering applications such as virtual assistants and voice-controlled systems; however, adapting these models to new domains remains a significant challenge, particularly in lowresource settings. Self-supervised learning (SSL) improves generalization by learning from unlabeled data, yet fine-tuning large SSL models is costly and risks overfitting with limited labeled data. This has driven interest in Parameter-Efficient Fine-Tuning (PEFT) methods, which adapt models by updating
only a small set of task-specific parameters.

Compared to full fine-tuning, PEFT significantly reduces memory, computation, and overfitting risks, making
it a promising solution for efficient and flexible model adaptation. PEFT methods have gained interest for their ability to prevent catastrophic forgetting and reduce computational costs by updating only a small subset of task-specific parameters while keeping most of the SSL model frozen. They fall into four main categories [Han et al., 2024]: Additive PEFT (e.g., adapters) that introduce new modules [Lester et al., 2021]; Selective PEFT (e.g., BitFit) that fine-tune selected parameters [Zaken et al., 2022]; Re-parameterized PEFT (e.g., LoRA) using low-rank updates [Hu et al., 2021]; and Hybrid PEFT (e.g., MAM Adapter) that combine multiple strategies [He et al., ].

PEFT methods, initially explored for Large Language Models (LLMs), are now gaining attention in speech
processing, especially for automatic speech recognition (ASR) tasks. Sparse LoRA has been used to adapt
Whisper for child speech recognition and multilingual ASR with reduced memory usage [Radford et al., 2023,
Song et al., 2024, Liu et al., 2024]. These approaches are also applied in federated learning for ASR and enhance performance in speech emotion recognition, spoken language understanding, and multilingual Text-to-Speech tasks [Ali et al., 2025, Lashkarashvili et al., 2024, Kim et al., 2023, Kwon et al., 2025].

Within this framework, the internship aims to assess the performance of various PEFT methods on the Spoken Language Understanding (SLU) task, through systematic exploration of different SSL models.

Keywords

Parameter-Efficient Fine-Tuning, Self-Supervised Learning, Automatic Speech processing.

References

  • [Ali et al., 2025] Ali, M. N., Falavigna, D., and Brutti, A. (2025). Efl-peft: A communication efficient federated learning framework using peft sparsification for ASR. In ICASSP.
  • [Baevski et al., 2020] Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework
    for self-supervised learning of speech representations. NeurIPS.
  • [Chiu et al., 2022] Chiu, C.-C., Qin, J., Zhang, Y., Yu, J., and Wu, Y. (2022). Self-supervised learning with
    random-projection quantizer for speech recognition. In ICML.
  • [Dinarelli et al., 2022] Dinarelli, M., Naguib, M., and Portet, F. (2022). Toward low-cost end-to-end spoken
    language understanding. In Interspeech.
  • [Han et al., 2024] Han, Z., Gao, C., Liu, J., Zhang, J., and Zhang, S. Q. (2024). Parameter-efficient fine-tuning for large models: A comprehensive survey. TMLR.
  • [He et al., ] He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameterefficient transfer learning. In ICLR.
  • [Hsu et al., 2021] Hsu, W., Bolte, B., Tsai, Y., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021).
    Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM
    TASLP.
  • [Hu et al., 2021] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W.
    (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  • [Kim et al., 2023] Kim, E., Jajodia, A., Tseng, C., Neelagiri, D., Ki, T., and Apsingekar, V. (2023). Efficient
    adaptation of spoken language understanding based on end-to-end automatic speech recognition. In
    Interspeech.
  • [Kwon et al., 2025] Kwon, K.-J., So, J.-H., and Lee, S.-H. (2025). Parameter-efficient fine-tuning for lowresource text-to-speech via cross-lingual continual learning. In Interspeech.
  • [Laperriere et al., 2024] Laperriere, G., Ghannay, S., Jabaian, B., and Esteve, Y. (2024). A dual task learning approach to fine-tune a multilingual semantic speech encoder for spoken language understanding. In Interspeech.
  • [Laperriere et al., 2023] Laperriere, G., Pelloin, V., Rouvier, M., Stafylakis, T., and Esteve, Y. (2023). On the use of semantically-aligned speech representations for spoken language understanding. In SLT. [Lashkarashvili et al., 2024] Lashkarashvili, N., Wu, W., Sun, G., and Woodland, P. C. (2024). Parameter efficient finetuning for speech emotion recognition and domain adaptation. In ICASSP.
  • [Lee et al., 2024] Lee, B., Calapodescu, I., Gaido, M., Negri, M., and Besacier, L. (2024). Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond. In Interspeech.
  • [Lester et al., 2021] Lester, B., Al-Rfou, R., and Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. In EMNLP. [Liu et al., 2024] Liu, W., Qin, Y., Peng, Z., and Lee, T. (2024). Sparsely shared lora on whisper for child speech recognition. In ICASSP.
  • [Mdhaffar et al., 2024] Mdhaffar, S., Bougares, F., De Mori, R., Zaiem, S., Ravanelli, M., and Esteve, Y.(2024). Taric-slu: A tunisian benchmark dataset for spoken language understanding. COLING.
  • [Radford et al., 2023] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023).
    Robust speech recognition via large-scale weak supervision. In ICML.
  • [Song et al., 2024] Song, Z., Zhuo, J., Yang, Y., Ma, Z., Zhang, S., and Chen, X. (2024). Lora-whisper:
    Parameter-efficient and extensible multilingual asr. Interspeech.
  • [Zaken et al., 2022] Zaken, E. B., Goldberg, Y., and Ravfogel, S. (2022). Bitfit: Simple parameter-efficient
    fine-tuning for transformer-based masked language-models. In ACL.

Contacts

Stipends

Around 680 €/months + access to staff dining facilities and the campus’s many sports facilities