The SEME (semantics and information extraction) team is interested in the problems of accessing the meaning contained in language productions, for the purposes of analysis, comprehension, modeling or production. We apply our research to the written modality, without restriction on the original medium (text produced in electronic format, or from a speech transcription, or from optical recognition) and work on productions in open or specialized domains such as the medical field. We use both linguistic and statistical or neural learning approaches. We are particularly interested in the latter type of approach, and in the environmental costs they generate in automatic language processing, both during production and during use.
Information extraction
Corpus and modeling
Sémantics, poly-lexical expressions
The team comprises 10 permanent members (CNRS researchers, lecturers at Université Paris-Saclay, ENSIIE, and Université Sorbonne Paris-Nord), 14 PhD students, and 3 post-docs or fixed-term contracts. We maintain links with industry (theses under CIFRE contracts, research projects) and regularly organize scientific events (TALN conference, scientific workshops, etc.).
Arezoo Saedi, Afsaneh Fatemi, Mohammad Ali Nematbakhsh, Sophie Rosset, Anne Vilnat. Entity search based on consumer preferences leveraging user reviews. Expert Systems with Applications, 2025, 275, pp.126990. ⟨10.1016/j.eswa.2025.126990⟩. ⟨hal-05047109⟩
Foucauld Estignard, Sahar Ghannay, Julien Girard-Satabin, Nicolas Hiebel, Aurélie Névéol. Evaluating the Confidentiality of Synthetic Clinical Texts Generated by Language Models. 23rd International Conference on Artificial Intelligence in Medicine (AIME), Jun 2025, Pavie, Italy. ⟨hal-05046326⟩
Lisa Raithel, Philippe Thomas, Bhuvanesh Verma, Roland Roller, Hui-Syuan Yeh, et al.. Overview of #SMM4H 2024 – Task 2: Cross-Lingual Few-Shot Relation Extraction for Pharmacovigilance in French, German, and Japanese. The 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks, Association for Computational Linguistics, Aug 2024, Bangkok, Thailand. pp.170-182. ⟨hal-04781015⟩
Mathilde Aguiar, Pierre Zweigenbaum, Nona Naderi. Am I eligible? Natural Language Inference for Clinical Trial Patient Recruitment: the Patient’s Point of View. 2025. ⟨hal-04992084⟩
Mathieu Constant, Marie Candito, Yannick Parmentier, Carlos Ramisch, Agata Savary. Construction, exploitation et exploration de ressources linguistiques pour le traitement automatique des expressions polylexicales en français : le projet PARSEME-FR. Lidia Becker; Julia Kuhn; Christina Ossenkop; Claudia Polzin-Haumann; Elton Prifti. Digitale romanistische Sprachwissenschaft: Stand und Perspektiven, Narr Francke Attempto Verlag GmbH + Co. KG, pp.219-250, 2023, Romanistisches Kolloquium, 978-3-8233-8506-6. ⟨hal-04995189⟩
Rémi Uro. Détection et caractérisation des interruptions dans les interactions orales pour la description du comportement des femmes et des hommes dans les contenus audiovisuels. Informatique et langage [cs.CL]. Université Paris-Saclay, 2024. Français. ⟨NNT : 2024UPASG055⟩. ⟨tel-04994439⟩
Amel Fraisse, Patrick Paroubek, Ramit Goyal, Nassreddine Znaidi. Measuring Multilingualism in Online Public Access Catalogs. The ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Dec 2024, Hong Kong, China. ⟨hal-04986773⟩
Manon Scholivet, Agata Savary, Louis Estève, Marie Candito, Carlos Ramisch. SELEXINI – a large and diverse automatically parsed corpus of French. Building and Using Comparable Corpora (BUCC), Jan 2025, Abu DHABI, United Arab Emirates. ⟨hal-04978746⟩
Camille Challant. Représentation formelle avec AZee et contraintes grammaticales pour la langue des signes française. Théorie et langage formel [cs.FL]. Université Paris-Saclay, 2024. Français. ⟨NNT : 2024UPASG086⟩. ⟨tel-04957486⟩
Zheng Zhang, Brian Denton, Xiaolan Xie. Branch and Price for Chance-Constrained Bin Packing. INFORMS Journal on Computing, 2020, 32 (3), pp.547-564. ⟨10.1287/ijoc.2019.0894⟩. ⟨hal-04941861⟩
Simon Devauchelle, David Doukhan, Lucas Ondel Yang, Benjamin Élie, Albert Rilliard. Estimation automatique de caractéristiques acoustiques pour l’étude diachronique du français oral dans les médias. Atelier DAHLIA: DigitAl Humanities and cuLtural herItAge: data and knowledge management and analysis, Claudia Marinica; Fabrice Guillet; Florent Laroche, Jan 2025, Strasbourg, France. ⟨hal-04938377⟩
Rémi Uro, David Doukhan. Pendant le confinement, le temps de parole des femmes a baissé à la télévision et à la radio. La revue des médias, 2020. ⟨hal-04906221⟩
Fanny Ducel, Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol. “Women do not have heart attacks!” Gender Biases in Automatically Generated Clinical Cases in French. NAACL 2025 – Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, Apr 2025, Albuquerque, United States. ⟨hal-04938811⟩
Marion Ficher, Tom Bauer, Anne-Laure Ligozat. A comprehensive review of the end-of-life modeling in LCAs of digital equipment. International Journal of Life Cycle Assessment, 2024, 30 (1), pp.20-42. ⟨10.1007/s11367-024-02367-x⟩. ⟨hal-04924691⟩
Léa-Marie Lam-Yee-Mui. Modélisations pour la reconnaissance de la parole à données contraintes. Traitement du signal et de l’image [eess.SP]. Université Paris-Saclay, 2024. Français. ⟨NNT : 2024UPASG075⟩. ⟨tel-04918814⟩
Philippe Boula de Mareüil, Plínio A. Barbosa. Picos melódicos pretônicos em final de enunciado no português brasileiro: um estudo quantitativo. Dermeval da Hora; Ángela Helmer. Interseções Linguísticas: Estudos Diversos, Líquido Editorial, pp.71-85, 2023, ALFAL, 9786599924804. ⟨hal-04893646⟩
Douglas Teodoro, Nona Naderi, Anthony Yazdani, Boya Zhang, Alban Bornet. A Scoping Review of Artificial Intelligence Applications in Clinical Trial Risk Assessment. 2025. ⟨hal-04913991⟩
Paritosh Sharma. Sign Language synthesis by a decreasing granularity system from AZee. Computation and Language [cs.CL]. Université Paris-Saclay, 2024. English. ⟨NNT : 2024UPASG092⟩. ⟨tel-04908078⟩
Laetitia Biscarrat, David Doukhan, Cyril Grouin. De Loft Story aux Marseillais à Dubaï : apport des méthodes d’analyse automatique pour la description des évolutions du dispositif télévisuel. Colloque ”La téléréalité, entre média, événement et société”, part of 89e Congrès de l’Association canadienne-française pour l’avancement des sciences (ACFAS), Association canadienne-française pour l’avancement des sciences (ACFAS), 2022, Montreal, Canada. ⟨hal-04906923⟩
Laetitia Biscarrat, David Doukhan, Cyril Grouin. De Loft Story aux Marseillais à Dubaï : 20 ans de télé-réalité, 20 ans de sexisme ? Apport des méthodes d’analyse automatique pour une approche comparative. Première journée d’études de l’Arcom, ARCOM, Nov 2022, Paris, France. ⟨hal-04905959⟩
Rémi Uro, Marie Tahon, David Doukhan, Albert Rilliard. Comprendre les phénomènes permettant la gestion des tours de parole dans les contenus de médias audiovisuels. Journée commune AFIA-TLH / AFCP – “Extraction de connaissances interprétables pour l’étude de la communication parlée”, Corinne Fredouille; Maëva Garnier; Olivier Perrotin; Marie Tahon, Dec 2023, Avignon, France. ⟨hal-04906679⟩
Leticia Rebollo Couto, Albert Rilliard. Variação Pragmática e Diminutivização: intensificação e atenuação de atos expressivos e diretivos para a dublagem de animação em português, espanhol e francês. IV Colloque International VariaR 2024, Université Paul-Valéry Montpellier 3, Jun 2024, Montpellier, France. pp.43-44, ⟨10.3726/978-3-0351-0740-1⟩. ⟨hal-04874595⟩
Sofiya Kobylyanskaya. Towards multimodal assessment of L2 level : speech and eye tracking features in a cross-cultural setting. Computation and Language [cs.CL]. Université Paris-Saclay, 2024. English. ⟨NNT : 2024UPASG111⟩. ⟨tel-04900961⟩
Leticia Rebollo Couto, Albert Rilliard. Variación pragmática y expresividad negativa: análisis multimodal en datos de doblaje. LingCor2024: Workshop on Spoken Corpus Linguistics, Jul 2024, Vienna, Austria. . ⟨hal-04874470⟩
Clémentine Bleuze, Fanny Ducel, Karën Fort, Maxime Amblard. Vers la création d’une super-intelligence » : un corpus pour étudier les revendications des articles de TALTraitement Automatique des langues. Journées de lancement LIFT 2, Nov 2024, Orléans, France. ⟨hal-04880335⟩
Ayoub Hammal, Benno Uthayasooriyar, Caio Corro. Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection. COLING 2025 – 31st International Conference on Computational Linguistics, Jan 2025, Abu Dhabi, United Arab Emirates. pp.1-15. ⟨hal-04877776⟩
Simon Devauchelle, Albert Rilliard, David Doukhan, Lucas Ondel Yang. Describing voice in French media archives: age and gender effects on pitch and articulation characteristics. XX Convegno Nazionale AISV, LFSAG (Laboratorio di Fonetica Sperimentale “Arturo Genre”) Dipartimento di Lingue e Letterature Straniere e Culture Moderne Università degli Studi di Torino, Feb 2024, Turin, Italy. ⟨hal-04874662⟩
Donna Erickson, João Antônio De Moraes, Albert Rilliard. Dimensões das atitudes prosódicas entre culturas. V Seminário Internacional de Fonologia, Universidade Federal do Rio de Janeiro, Nov 2024, Rio de Janeiro, Brazil. ⟨hal-04874627⟩
Delphine Bernhard, Myriam Bras, Anne-Laure Ligozat, Aleksandra Miletic, Jean Sibille, et al.. L’avenir numérique des langues minoritaires : bilan du projet RESTAURE pour l’alsacien, l’occitan et le picard. Colloque « Langues minoritaires » : quels acteurs pour quel avenir ?, Groupe d’Etudes sur le Plurilinguisme européen (EA1339 LiLPa), Nov 2019, Strasbourg, France. ⟨hal-04864670⟩
Cyril Grouin, Natalia Grabar. Year 2023 in Biomedical Natural Language Processing: A Tribute to Large Language Models and Generative AI. IMIA Yearbook of Medical Informatics, 2024. ⟨hal-04865083⟩
Natalia Grabar, Thierry Hamon. Study of the propaganda techniques occurring in Russian newspaper titles in 2022. METAPOL, université de Liège, Nov 2024, Liège, Belgium. ⟨hal-04865074⟩
Angèle Gayet-Ageron, Khaoula Ben Messaoud, Mark Richards, Cyril Jaksic, Julien Gobeill, et al.. Gender and geographical bias in the editorial decision-making process of biomedical journals: a case-control study. BMJ Evidence-Based Medicine, 2024, pp.bmjebm-2024-113083. ⟨10.1136/bmjebm-2024-113083⟩. ⟨hal-04865134⟩
Omar Adjali, Olivier Ferret, Sahar Ghannay, Hervé Le Borgne. Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering. The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Nov 2024, Miami, United States. pp.16499-16513, ⟨10.18653/v1/2024.emnlp-main.922⟩. ⟨hal-04852275⟩
Aurélie Bugeau, Anne-Laure Ligozat. L’informatique en temps de crises environnementales : comment adapter la recherche et l’enseignement ?. 2024. ⟨hal-04850517⟩
Donna Erickson, Albert Rilliard, Ela Thurgood, João Antônio de Moraes, Takaaki Shochi. Acoustic and perceptual profiles of american english social affective expressions. Journal of Speech Sciences, 2024, 13, pp.e024004. ⟨10.20396/joss.v13i00.20015⟩. ⟨hal-04850040⟩
Clément Morand, Anne-Laure Ligozat, Aurélie Névéol. How Green Can AI Be? A Study of Trends in Machine Learning Environmental Impacts. 2024. ⟨hal-04839926v3⟩
Lucie Gianola. Traitement automatique des langues et linguistique de corpus pour la reconnaissance d’entités en analyse criminelle. Revue internationale de criminologie et de police technique et scientifique, 2021, LXXIV (3), pp.363-382. ⟨hal-04833123⟩
Mathilde Aguiar, Ying Lai, Pierre Zweigenbaum, Nona Naderi. Constituting a dataset for applying Natural Language Inference to Chinese Clinical Trials: possible approaches and challenges. Junior Conference on Data Sciences and Engineering, Sep 2024, Gif-sur-Yvette, France. ⟨hal-04837721⟩
Ilia Kuznetsov, Osama Mohammed Afzal, Koen Dercksen, Nils Dycke, Alexander Goldberg, et al.. What Can Natural Language Processing Do for Peer Review?. 2024. ⟨hal-04797652⟩