The SEME (semantics and information extraction) team is interested in the problems of accessing the meaning contained in language productions, for the purposes of analysis, comprehension, modeling or production. We apply our research to the written modality, without restriction on the original medium (text produced in electronic format, or from a speech transcription, or from optical recognition) and work on productions in open or specialized domains such as the medical field. We use both linguistic and statistical or neural learning approaches. We are particularly interested in the latter type of approach, and in the environmental costs they generate in automatic language processing, both during production and during use.
Information extraction
Corpus and modeling
Sémantics, poly-lexical expressions
The team comprises 10 permanent members (CNRS researchers, lecturers at Université Paris-Saclay, ENSIIE, and Université Sorbonne Paris-Nord), 14 PhD students, and 3 post-docs or fixed-term contracts. We maintain links with industry (theses under CIFRE contracts, research projects) and regularly organize scientific events (TALN conference, scientific workshops, etc.).
Julie Halbout, Diandra Fabre. Corpus bilingue sous-titrage et Langue des Signes Française : la problématique de l’alignement automatique des données. 20e Conférence en Recherche d’Information et Applications (CORIA) 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN) 27ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL) Les 18e Rencontres Jeunes Chercheurs en RI (RJCRI), 2025, Marseille, France. pp.91-103. ⟨hal-05330660⟩
Cyril Grouin, Aurélie Névéol, Pierre Zweigenbaum, Christophe Servan. Comment évaluer un grand modèle de langue dans le domaine médical en français ?. 20e Conférence en Recherche d’Information et Applications (CORIA) 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN) 27ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL) Les 18e Rencontres Jeunes Chercheurs en RI (RJCRI), 2025, Marseille, France. pp.51-67. ⟨hal-05329783⟩
Omar Adjali, Olivier Ferret, Sahar Ghannay, Hervé Le Borgne. Génération augmentée de récupération multi-niveau pour répondre à des questions visuelles. 20e Conférence en Recherche d’Information et Applications (CORIA) 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN) 27ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL) Les 18e Rencontres Jeunes Chercheurs en RI (RJCRI), 2025, Marseille, France. pp.128-130. ⟨hal-05330645⟩
Anne-Laure Ligozat. Côté obscur de l’IA : quels bénéfices réels de l’IA pour faire face aux crises environnementales ?. GreenDays 2023, Mar 2023, Lyon, France. ⟨hal-05317071⟩
Armand Stricker, Patrick Paroubek. Chitchat as Interference: Adding User Backstories to Task-Oriented Dialogues. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA; ICCL, May 2024, Torino, Italy. pp.3203–3214. ⟨hal-05242362⟩
Fanny Ducel, Jeffrey André, Aurélie Névéol, Karën Fort. Introducing MascuLead: the First Gender Bias Leaderboard. EALM 2025 – Ethic and Alignment of (Large) Language Models, Jun 2025, Marseille, France. pp.12-19. ⟨hal-05282981⟩
Fanny Ducel, Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol. « Les femmes ne font pas de crise cardiaque ! » Étude des biais de genre dans les cas cliniques synthétiques en français. 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2025), Jul 2025, Marseille, France. pp.1. ⟨hal-05282965⟩
Clémentine Bleuze, Fanny Ducel, Maxime Amblard, Karën Fort. « De nos jours, ce sont les résultats qui comptent » : création et étude diachronique d’un corpus de revendications issues d’articles de TALTraitement Automatique des langues. TALN 2025 – 32ème Conférence sur le Traitement Automatique des Langues Naturelles, Jul 2025, Marseille, France. ⟨hal-05282966⟩
Yajing Feng. Continuous Recognition of Client Emotions from Speech and Text in Real-World Call Center Conversations : a Context-Aware Dataset and Empirical Study. Artificial Intelligence [cs.AI]. Université Paris-Saclay, 2025. English. ⟨NNT : 2025UPASG042⟩. ⟨tel-05241382⟩
Alexander Goldberg, Ihsan Ullah, Thanh Gia Hieu Khuong, Benedictus Kent Rachmat, Zhen Xu, et al.. Usefulness of LLMs as an Author Checklist Assistant for Scientific Papers: NeurIPS’24 Experiment. 2025. ⟨hal-05230379⟩
Floris Thiant, Olivia Penas, Yann Leroy, Anne-Laure Ligozat. System analysis of digital service system perimeter and its interdependencies in Life Cycle Assessment. 2025 IEEE International Symposium on Systems Engineering (ISSE), Oct 2025, Palaiseau, France. ⟨hal-05240543⟩
Thomas Gerald, Louis Tamames, Sofiane Ettayeb, Ha-Quang Le, Patrick Paroubek, et al.. CQuAE: A new Contextualized QUestion Answering corpus on Education domain. Data and Knowledge Engineering, 2024, 151, pp.102305. ⟨10.1016/j.datak.2024.102305⟩. ⟨hal-05242257⟩
Tommaso Raso, Saulo Mendes Santos, Albert Rilliard, João A. Moraes. Defining and Identifying Discourse Markers in Spontaneous Speech. Miguel Oliveira, Jr. Prosodic Interfaces – Interdisciplinary Perspectives on Sound Patterns and Human Interaction, De Gruyter, pp.65-102, 2025, 978-3-11-105990-7. ⟨10.1515/9783111060309-003⟩. ⟨hal-05230528⟩
Clémence Sebe, Sarah Cohen-Boulakia, Olivier Ferret, Aurélie Névéol. Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows. Symposium on Intelligent Data Analysis (IDA 2025), May 2025, Konstanz, Germany. pp.274-287, ⟨10.1007/978-3-031-91398-3_21⟩. ⟨hal-05244222⟩
Philippe Boula de Mareüil, Paolo Roseano. A speaking atlas of the languages of the Iberian Peninsula: focus on rhythm and varieties in contact. Dialectologia, 2025, 35, pp.27-54. ⟨10.1344/dialectologia.35.2⟩. ⟨hal-05263043⟩
Gaël Guennebaud, Anne-Laure Ligozat, Anne-Cécile Orgerie, Matthieu Simonin. Evaluating and Reporting the Carbon Footprint of Shared Computing Platforms: Choices and Limits. ISPDC 2025 – 24th IEEE International Symposium on Parallel and Distributed Computing, Jul 2025, Rennes, France. pp.1-7. ⟨hal-05195576⟩
Haohua Dong, Ana Manzano Rodríguez, Camille Guinaudeau, Shin’Ichi Satoh. Fairness Without Labels: Pseudo-Balancing for Bias Mitigation in Face Gender Classification. Second workshop on Fairness and ethics towards transparent AI: facing the chalLEnge through model Debiasing (FAILED) at the 2025 International Conference on Computer Vision, Oct 2025, Honolulu, HI, United States. ⟨hal-05210445⟩
Nicolas Hiebel. Création éthique de données textuelles artificielles : application au domaine biomédical. Traitement du texte et du document. Université Paris-Saclay, 2025. Français. ⟨NNT : 2025UPASG033⟩. ⟨tel-05185326⟩
Philippe Boula de Mareüil, Alexis Pierrard, Albert Rilliard. Acoustic study of /r/ front fricatives in Bolivian Highland Spanish. Estudios de Fonética Experimental , 2025, 34, pp.41 – 56. ⟨10.1344/efe-2025-34-41-56⟩. ⟨hal-05157171⟩
Ana Manzano Rodríguez, Camille Guinaudeau, Shin Ichi Satoh. Uncovering Gender Biases in Gender Identification Models for Japanese Data Analysis. Workshop on Demographic Diversity in Computer Vision @ CVPR 2025, Jun 2025, Nashville (Tennessee), United States. ⟨hal-05154054⟩
Philippe Boula de Mareüil, Marc Evrard, Alexandre François, Antonio Romano. Computer modelling of innovations relative to Latin in contemporary Romance dialects. Isogloss. Open Journal of Romance Linguistics, 2025, 11 (3), pp.1 – 31. ⟨10.5565/rev/isogloss.423⟩. ⟨hal-05144863⟩
Pierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset. Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning. Interspeech 2025, Aug 2025, Rotterdam, Netherlands. ⟨10.21437/Interspeech.2025-175⟩. ⟨hal-05095796⟩
Mathieu Laï-King. Qualité des articles de recherche et modèles de langue neuronaux : applications au domaine biomédical. Intelligence artificielle [cs.AI]. Université Paris-Saclay, 2025. Français. ⟨NNT : 2025UPASG031⟩. ⟨tel-05079724⟩
Clément Morand, Anne-Laure Ligozat, Aurélie Névéol. Characterizing Goals and Impacts of Digitalization: The Case of Promises in French Healthcare Policies. 2025. ⟨hal-05066176⟩
Luc Mottin, Julien Gobeill, Jeevanthi Liyana Pathirana, Nona Naderi, Anaïs Mottaz, et al.. Manuscript Classification to Support the Analysis of Biases in Publication Opportunities. The 35th Medical Informatics Europe Conference, May 2025, Glagow, United Kingdom. ⟨10.3233/SHTI250475⟩. ⟨hal-05070636⟩
Karin Dassas, Cyrille Bonamy, Bruno Bzeznik, Romaric David, Emmanuelle Frenoux, et al.. Estimer l’impact carbone des activités numériques de l’Observatoire de Paris. EcoInfo. 2025, pp.1-47. ⟨hal-05068666⟩
Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol. Clinical text generation: Are we there yet?. Annual Review of Biomedical Data Science, 2025, 8, pp.173-198. ⟨10.1146/annurev-biodatasci-103123-095202⟩. ⟨hal-05055957⟩
Arezoo Saedi, Afsaneh Fatemi, Mohammad Ali Nematbakhsh, Sophie Rosset, Anne Vilnat. Entity search based on consumer preferences leveraging user reviews. Expert Systems with Applications, 2025, 275, pp.126990. ⟨10.1016/j.eswa.2025.126990⟩. ⟨hal-05047109⟩
Foucauld Estignard, Sahar Ghannay, Julien Girard-Satabin, Nicolas Hiebel, Aurélie Névéol. Evaluating the Confidentiality of Synthetic Clinical Texts Generated by Language Models. 23rd International Conference on Artificial Intelligence in Medicine (AIME), Jun 2025, Pavie, Italy. ⟨hal-05046326v2⟩
Lisa Raithel, Philippe Thomas, Bhuvanesh Verma, Roland Roller, Hui-Syuan Yeh, et al.. Overview of #SMM4H 2024 – Task 2: Cross-Lingual Few-Shot Relation Extraction for Pharmacovigilance in French, German, and Japanese. The 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks, Association for Computational Linguistics, Aug 2024, Bangkok, Thailand. pp.170-182. ⟨hal-04781015⟩
Mathilde Aguiar, Pierre Zweigenbaum, Nona Naderi. Am I eligible? Natural Language Inference for Clinical Trial Patient Recruitment: the Patient’s Point of View. 2025. ⟨hal-04992084⟩
Mathieu Constant, Marie Candito, Yannick Parmentier, Carlos Ramisch, Agata Savary. Construction, exploitation et exploration de ressources linguistiques pour le traitement automatique des expressions polylexicales en français : le projet PARSEME-FR. Lidia Becker; Julia Kuhn; Christina Ossenkop; Claudia Polzin-Haumann; Elton Prifti. Digitale romanistische Sprachwissenschaft: Stand und Perspektiven, Narr Francke Attempto Verlag GmbH + Co. KG, pp.219-250, 2023, Romanistisches Kolloquium, 978-3-8233-8506-6. ⟨hal-04995189⟩
Rémi Uro. Détection et caractérisation des interruptions dans les interactions orales pour la description du comportement des femmes et des hommes dans les contenus audiovisuels. Informatique et langage [cs.CL]. Université Paris-Saclay, 2024. Français. ⟨NNT : 2024UPASG055⟩. ⟨tel-04994439⟩
Amel Fraisse, Patrick Paroubek, Ramit Goyal, Nassreddine Znaidi. Measuring Multilingualism in Online Public Access Catalogs. The ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Dec 2024, Hong Kong, China. ⟨10.1145/3677389.3702544⟩. ⟨hal-04986773⟩
Manon Scholivet, Agata Savary, Louis Estève, Marie Candito, Carlos Ramisch. SELEXINI – a large and diverse automatically parsed corpus of French. Building and Using Comparable Corpora (BUCC), Jan 2025, Abu Dhabi, United Arab Emirates. ⟨hal-04978746⟩
Camille Challant. Représentation formelle avec AZee et contraintes grammaticales pour la langue des signes française. Théorie et langage formel [cs.FL]. Université Paris-Saclay, 2024. Français. ⟨NNT : 2024UPASG086⟩. ⟨tel-04957486⟩
Zheng Zhang, Brian Denton, Xiaolan Xie. Branch and Price for Chance-Constrained Bin Packing. INFORMS Journal on Computing, 2020, 32 (3), pp.547-564. ⟨10.1287/ijoc.2019.0894⟩. ⟨hal-04941861⟩
Simon Devauchelle, David Doukhan, Lucas Ondel Yang, Benjamin Élie, Albert Rilliard. Estimation automatique de caractéristiques acoustiques pour l’étude diachronique du français oral dans les médias. Atelier DAHLIA: DigitAl Humanities and cuLtural herItAge: data and knowledge management and analysis, Claudia Marinica; Fabrice Guillet; Florent Laroche, Jan 2025, Strasbourg, France. ⟨hal-04938377⟩
Rémi Uro, David Doukhan. Pendant le confinement, le temps de parole des femmes a baissé à la télévision et à la radio. La revue des médias, 2020. ⟨hal-04906221⟩
Fanny Ducel, Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol. “Women do not have heart attacks!” Gender Biases in Automatically Generated Clinical Cases in French. NAACL 2025 – Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, Apr 2025, Albuquerque, United States. pp.7145-7159, ⟨10.18653/v1/2025.findings-naacl.398⟩. ⟨hal-04938811⟩
Marion Ficher, Tom Bauer, Anne-Laure Ligozat. A comprehensive review of the end-of-life modeling in LCAs of digital equipment. International Journal of Life Cycle Assessment, 2024, 30 (1), pp.20-42. ⟨10.1007/s11367-024-02367-x⟩. ⟨hal-04924691⟩