M3

Models, Methods, and Multilingualism (M3)

Coordination : Albert RILLIARD

The M3 team is interested in models and methods aimed at the fundamental description and automatic processing of natural language and its multilingual dimension. This involves the development of models and methods able to process, extract, and describe relevant characteristics of these systems from datasets. We consider languages in all their uses, with special focus on under-resourced languages and comparative approaches to language description (multilingual approaches). The team questions the interactions between models and systems: How does a model help our understanding of languages? How do sets of languages differ, and at what levels? How do we use generative models to produce controlled instances of given phenomena? The implementation of explainable and adapted models allows interaction between disciplines and promotes interdisciplinary work (computer science, language sciences, sociology, psychology). The research carried out in the M3 team falls into three main axes:

A- Models and Methods

This axis focuses on learning paradigms: developing data models and algorithms (models with many or few parameters, generative or not), with a view to their application to automatic language processing and languages as structured objects. These models are typically applied to the objects studied in the other two axes. Particular attention is paid to issues related to accessibility: reflecting on the specific methods to be implemented to develop inclusive technologies for our society. Effective, sober models, adapted to the representation of specific data, are particularly sought to promote explainable and responsible approaches to the data studied. These models make it possible to create hybrid solutions that try to control generative AI. They also provide approaches that can be used to build more ethical systems.

B- Typology, Variation, and Universals in Languages

This axis focuses on describing the characteristics of linguistic systems based on corpora. This involves automatically applying typological schemes and language comparisons according to these characteristics. Considering variation is a major point, with work on diatopic, diastratic, diaphasic, or diachronic changes within languages, or linked to language contact of under-resourced or well-described languages. Syntactic, phonological, phonetic, articulatory, and prosodic systems are considered. The representation of the differences (in terms of distances or projected onto an atlas) between the systems studied is another highlight.

C- Contextualized Behaviors for Interaction

This axis models and describes performances during situated communicational interactions, at para- and extra-linguistic levels: whether for pragmatic functions (speech acts, attitudinal nuances), emotions (affective interaction, social emotions), nudges (gentle manipulation), vocal effort (Lombard speech, voice strength), etc. It aims to propose models of behavioral changes linked to these phenomena, in order to be able to detect them, measure their variation or dynamics, and categorize them. The analysis of the acoustic-linguistic parameters of the voice (parameters derived from models, glottal source, articulatory choices, etc.) makes it possible to link performances and functions.

Coordination

Members

Publications

  • Communication dans un congrès

    Diandra Fabre, Julie Lascar, Julie Halbout, Yanis Ouakrim, Annelies Braffort, et al.. Exploring Sign-level Strategies to Enhance Automatic Translation of French Sign Language. ACM International Conference on Intelligent Virtual Agents, Sep 2025, Berlin, Germany. ⟨10.1145/3742886.3756733⟩. ⟨hal-05280328⟩

    AMIArchitectures et modèles pour l'Interaction, STL

    Year of publication

    Available in free access

  • Thèse

    Marco Naguib. Extraction d’information clinique : méthodes et ressources pour l’adaptation en domaine. Informatique [cs]. Université Paris-Saclay, 2025. Français. ⟨NNT : 2025UPASG054⟩. ⟨tel-05289152⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Armand Stricker, Patrick Paroubek. Chitchat as Interference: Adding User Backstories to Task-Oriented Dialogues. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA; ICCL, May 2024, Torino, Italy. pp.3203–3214. ⟨hal-05242362⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Fanny Ducel, Jeffrey André, Aurélie Névéol, Karën Fort. Introducing MascuLead: the First Gender Bias Leaderboard. EALM 2025 – Ethic and Alignment of (Large) Language Models, Jun 2025, Marseille, France. pp.12-19. ⟨hal-05282981⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Fanny Ducel, Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol. « Les femmes ne font pas de crise cardiaque ! » Étude des biais de genre dans les cas cliniques synthétiques en français. 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2025), Jul 2025, Marseille, France. pp.1. ⟨hal-05282965⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Clémentine Bleuze, Fanny Ducel, Maxime Amblard, Karën Fort. « De nos jours, ce sont les résultats qui comptent » : création et étude diachronique d’un corpus de revendications issues d’articles de TALTraitement Automatique des langues. TALN 2025 – 32ème Conférence sur le Traitement Automatique des Langues Naturelles, Jul 2025, Marseille, France. ⟨hal-05282966⟩

    STL

    Year of publication

    Available in free access

  • Thèse

    Yajing Feng. Continuous Recognition of Client Emotions from Speech and Text in Real-World Call Center Conversations : a Context-Aware Dataset and Empirical Study. Artificial Intelligence [cs.AI]. Université Paris-Saclay, 2025. English. ⟨NNT : 2025UPASG042⟩. ⟨tel-05241382⟩

    STL

    Year of publication

    Available in free access

  • Pré-publication, Document de travail

    Alexander Goldberg, Ihsan Ullah, Thanh Gia Hieu Khuong, Benedictus Kent Rachmat, Zhen Xu, et al.. Usefulness of LLMs as an Author Checklist Assistant for Scientific Papers: NeurIPS’24 Experiment. 2025. ⟨hal-05230379⟩

    AO, STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Floris Thiant, Olivia Penas, Yann Leroy, Anne-Laure Ligozat. System analysis of digital service system perimeter and its interdependencies in Life Cycle Assessment. 2025 IEEE International Symposium on Systems Engineering (ISSE), Oct 2025, Palaiseau, France. ⟨hal-05240543⟩

    STL, STL

    Year of publication

  • Article dans une revue

    Thomas Gerald, Louis Tamames, Sofiane Ettayeb, Ha-Quang Le, Patrick Paroubek, et al.. CQuAE: A new Contextualized QUestion Answering corpus on Education domain. Data and Knowledge Engineering, 2024, 151, pp.102305. ⟨10.1016/j.datak.2024.102305⟩. ⟨hal-05242257⟩

    STL

    Year of publication

  • Chapitre d'ouvrage

    Tommaso Raso, Saulo Mendes Santos, Albert Rilliard, João A. Moraes. Defining and Identifying Discourse Markers in Spontaneous Speech. Miguel Oliveira, Jr. Prosodic Interfaces – Interdisciplinary Perspectives on Sound Patterns and Human Interaction, De Gruyter, pp.65-102, 2025, 978-3-11-105990-7. ⟨10.1515/9783111060309-003⟩. ⟨hal-05230528⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Clémence Sebe, Sarah Cohen-Boulakia, Olivier Ferret, Aurélie Névéol. Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows. Symposium on Intelligent Data Analysis (IDA 2025), May 2025, Konstanz, Germany. pp.274-287, ⟨10.1007/978-3-031-91398-3_21⟩. ⟨hal-05244222⟩

    BioInfo, STL

    Year of publication

    Available in free access

  • Article dans une revue

    Philippe Boula de Mareüil, Paolo Roseano. A speaking atlas of the languages of the Iberian Peninsula: focus on rhythm and varieties in contact. Dialectologia, 2025, 35, pp.27-54. ⟨10.1344/dialectologia.35.2⟩. ⟨hal-05263043⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Gaël Guennebaud, Anne-Laure Ligozat, Anne-Cécile Orgerie, Matthieu Simonin. Evaluating and Reporting the Carbon Footprint of Shared Computing Platforms: Choices and Limits. ISPDC 2025 – 24th IEEE International Symposium on Parallel and Distributed Computing, Jul 2025, Rennes, France. pp.1-7. ⟨hal-05195576⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Haohua Dong, Ana Manzano Rodríguez, Camille Guinaudeau, Shin’Ichi Satoh. Fairness Without Labels: Pseudo-Balancing for Bias Mitigation in Face Gender Classification. Second workshop on Fairness and ethics towards transparent AI: facing the chalLEnge through model Debiasing (FAILED) at the 2025 International Conference on Computer Vision, Oct 2025, Honolulu, HI, United States. ⟨hal-05210445⟩

    STL

    Year of publication

    Available in free access

  • Thèse

    Nicolas Hiebel. Création éthique de données textuelles artificielles : application au domaine biomédical. Traitement du texte et du document. Université Paris-Saclay, 2025. Français. ⟨NNT : 2025UPASG033⟩. ⟨tel-05185326⟩

    STL, STL

    Year of publication

    Available in free access

  • Article dans une revue

    Philippe Boula de Mareüil, Alexis Pierrard, Albert Rilliard. Acoustic study of /r/ front fricatives in Bolivian Highland Spanish. Estudios de Fonética Experimental , 2025, 34, pp.41 – 56. ⟨10.1344/efe-2025-34-41-56⟩. ⟨hal-05157171⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Ana Manzano Rodríguez, Camille Guinaudeau, Shin Ichi Satoh. Uncovering Gender Biases in Gender Identification Models for Japanese Data Analysis. Workshop on Demographic Diversity in Computer Vision @ CVPR 2025, Jun 2025, Nashville (Tennessee), United States. ⟨hal-05154054⟩

    STL

    Year of publication

    Available in free access

  • Thèse

    Jiahui Hu. Granular Insights into Financial Discourse : Fine-Grained Opinion Analysis of Expert Texts. Document and Text Processing. Université Paris-Saclay, 2023. English. ⟨NNT : 2023UPASG110⟩. ⟨tel-05153905⟩

    AO, STL

    Year of publication

    Available in free access

  • Article dans une revue

    Philippe Boula de Mareüil, Marc Evrard, Alexandre François, Antonio Romano. Computer modelling of innovations relative to Latin in contemporary Romance dialects. Isogloss. Open Journal of Romance Linguistics, 2025, 11 (3), pp.1 – 31. ⟨10.5565/rev/isogloss.423⟩. ⟨hal-05144863⟩

    STL

    Year of publication

    Available in free access

  • Article dans une revue

    Anne Baillot, Anne-Laure Ligozat. Introduction. Sobriété numérique. Humanités numériques, 2025, 11, ⟨10.4000/1498x⟩. ⟨hal-05143071⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Pierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset. Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning. Interspeech 2025, Aug 2025, Rotterdam, Netherlands. ⟨10.21437/Interspeech.2025-175⟩. ⟨hal-05095796⟩

    STL, STL

    Year of publication

    Available in free access

  • Article dans une revue

    Agata Savary. NLP-based Study of Universals of Linguistic Idiosyncrasy. Dagstuhl Reports, 2023, 13 (5), pp.64-67. ⟨hal-04323075⟩

    ILES, ILES, STL

    Year of publication

  • Thèse

    Mathieu Laï-King. Qualité des articles de recherche et modèles de langue neuronaux : applications au domaine biomédical. Intelligence artificielle [cs.AI]. Université Paris-Saclay, 2025. Français. ⟨NNT : 2025UPASG031⟩. ⟨tel-05079724⟩

    STL

    Year of publication

    Available in free access

  • Pré-publication, Document de travail

    Clément Morand, Anne-Laure Ligozat, Aurélie Névéol. Characterizing Goals and Impacts of Digitalization: The Case of Promises in French Healthcare Policies. 2025. ⟨hal-05066176⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Luc Mottin, Julien Gobeill, Jeevanthi Liyana Pathirana, Nona Naderi, Anaïs Mottaz, et al.. Manuscript Classification to Support the Analysis of Biases in Publication Opportunities. The 35th Medical Informatics Europe Conference, May 2025, Glagow, United Kingdom. ⟨10.3233/SHTI250475⟩. ⟨hal-05070636⟩

    STL

    Year of publication

    Available in free access

  • Rapport

    Karin Dassas, Cyrille Bonamy, Bruno Bzeznik, Romaric David, Emmanuelle Frenoux, et al.. Estimer l’impact carbone des activités numériques de l’Observatoire de Paris. EcoInfo. 2025, pp.1-47. ⟨hal-05068666⟩

    STL

    Year of publication

    Available in free access

  • Article dans une revue

    Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol. Clinical text generation: Are we there yet?. Annual Review of Biomedical Data Science, 2025, 8, pp.173-198. ⟨10.1146/annurev-biodatasci-103123-095202⟩. ⟨hal-05055957⟩

    STL

    Year of publication

    Available in free access

  • Article dans une revue

    Arezoo Saedi, Afsaneh Fatemi, Mohammad Ali Nematbakhsh, Sophie Rosset, Anne Vilnat. Entity search based on consumer preferences leveraging user reviews. Expert Systems with Applications, 2025, 275, pp.126990. ⟨10.1016/j.eswa.2025.126990⟩. ⟨hal-05047109⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Foucauld Estignard, Sahar Ghannay, Julien Girard-Satabin, Nicolas Hiebel, Aurélie Névéol. Evaluating the Confidentiality of Synthetic Clinical Texts Generated by Language Models. 23rd International Conference on Artificial Intelligence in Medicine (AIME), Jun 2025, Pavie, Italy. ⟨hal-05046326v2⟩

    STL

    Year of publication

    Available in free access