SEME

SEMantics and Information Extraction (SEME)

Coordination: Cyril GROUIN

The SEME (semantics and information extraction) team is interested in the problems of accessing the meaning contained in language productions, for the purposes of analysis, comprehension, modeling or production. We apply our research to the written modality, without restriction on the original medium (text produced in electronic format, or from a speech transcription, or from optical recognition) and work on productions in open or specialized domains such as the medical field. We use both linguistic and statistical or neural learning approaches. We are particularly interested in the latter type of approach, and in the environmental costs they generate in automatic language processing, both during production and during use.

  • Information extraction
  • Corpus and modeling
  • Sémantics, poly-lexical expressions

The team comprises 10 permanent members (CNRS researchers, lecturers at Université Paris-Saclay, ENSIIE, and Université Sorbonne Paris-Nord), 14 PhD students, and 3 post-docs or fixed-term contracts. We maintain links with industry (theses under CIFRE contracts, research projects) and regularly organize scientific events (TALN conference, scientific workshops, etc.).

Coordination

  • Sciences et Technologies des Langues

    SEME

    Grouin Cyril

    Research Engineer

    Head of SEME

Members

Publications

  • Communication dans un congrès

    Maxime Fily, Guillaume Wisniewski, Séverine Guillaume, Gilles Adda, Alexis Michaud. Establishing degrees of closeness between audio recordings along different dimensions using large-scale cross-lingual models. Findings of the Association for Computational Linguistics: EACL 2024, Association for Computational Linguistics, Mar 2024, St. Julian’s, Malta. ⟨hal-04561819⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Hugo Boulanger, Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol. Using Structured Health Information for Controlled Generation of Clinical Cases in French. The 6th Clinical Natural Language Processing Workshop At NAACL 2024 (ClinicalNLP 2024), Jun 2024, Mexico city, Mexico. ⟨hal-04558890⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Nicolas Hiebel, Bertrand Remy, Bruno Guillaume, Olivier Ferret, Aurélie Névéol, et al.. Hostomytho: A GWAP for Synthetic Clinical Texts Evaluation and Annotation. Games and Natural Language Processing Workshop at LREC-COLING 2024, May 2024, Turin, Italy, May 2024, Turin (Italie), Italy. ⟨hal-04555052⟩

    STL

    Year of publication

    Available in free access

  • Thèse

    Oralie Cattan. Systèmes de questions-réponses interactifs à grande échelle. Informatique [cs]. Université Paris-Saclay (2020-..), 2022. Français. ⟨NNT : ⟩. ⟨tel-04551072⟩

    STL

    Year of publication

  • Article dans une revue

    Luma da Silva Miranda, João Antônio de Moraes, Albert Rilliard. Visual channel facilitates the comprehension of the intonation of Brazilian Portuguese wh-questions and wh-exclamations: evidence from congruent and incongruent stimuli. Language and Cognition, 2024, pp.1-21. ⟨10.1017/langcog.2024.16⟩. ⟨hal-04538371⟩

    STL

    Year of publication

    Available in free access

  • Pré-publication, Document de travail

    Mathilde Aguiar, Pierre Zweigenbaum, Nona Naderi. SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials. 2024. ⟨hal-04536273⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Djegdjiga Amazouz, Martine-Adda Decker, Lori Lamel. Variation du voisement des occlusives orales en code-switching: analyses par ABX automatique et mesures acoustiques. Journées d’Études sur la Parole – JEP2022, Jun 2022, Noirmoutier, France. ⟨hal-03703081⟩

    STL

    Year of publication

    Available in free access

  • Pré-publication, Document de travail

    Mathilde Aguiar, Pierre Zweigenbaum, Nona Naderi. SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials. 2024. ⟨hal-04536600⟩

    STL

    Year of publication

  • Communication dans un congrès

    Karën Fort, Laura Alonso Alemany, Luciana Benotti, Julien Bezançon, Claudia Borg, et al.. Your Stereotypical Mileage may Vary: Practical Challenges of Evaluating Biases in Multiple Languages and Cultural Contexts. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, May 2024, Turin (Italie), Italy. ⟨hal-04537096⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Paul Lerner, Cyril Grouin. INCLURE: a Dataset and Toolkit for Inclusive French Translation. The 17th Workshop on Building and Using Comparable Corpora (BUCC @ LREC 2024), 2024, Turin, Italy. ⟨hal-04531938⟩

    STL

    Year of publication

    Available in free access

  • Proceedings/Recueil des communications

    Karën Fort, Aurélie Névéol. Ethics and NLP: 10 years after. Journée d’études ATALA “éthique et TALTraitement Automatique des langues : 10 ans après”, 2024. ⟨hal-04533870⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Paul Lerner, Olivier Ferret, Camille Guinaudeau. Cross-modal Retrieval for Knowledge-based Visual Question Answering. 46th European Conference on Information Retrieval (ECIR 2024), 2024, Glasgow, United Kingdom. ⟨hal-04384431⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Tomohiro Nishiyama, Lisa Raithel, Roland Roller, Pierre Zweigenbaum, Eiji Aramaki. Assessing Authenticity and Anonymity of Synthetic User-generated Content in the Medical Domain. Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo), Mar 2024, St. Julian’s, Malta. pp.8-17. ⟨hal-04528240⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Nadège Alavoine, Gaëlle Laperriere, Christophe Servan, Sahar Ghannay, Sophie Rosset. New Semantic Task for the French Spoken Language Understanding MEDIA Benchmark. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024, Torino, Italy. ⟨hal-04523286⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Nesrine Bannour, Christophe Servan, Aurélie Névéol, Xavier Tannier. A Benchmark Evaluation of Clinical Named Entity Recognition in French. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024, Torino, Italy. ⟨hal-04523267⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Christophe Servan, Sahar Ghannay, Sophie Rosset. mALBERT: Is a Compact Multilingual BERT Model Still Worth It?. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, May 2024, Torino, Italy. ⟨hal-04520797⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Aaron Boussidan, Fanny Ducel, Aurélie Névéol, Karën Fort. What ChatGPT tells us about ourselves. Journée d’étude Éthique et TALTraitement Automatique des langues 2024, Apr 2024, Nancy, France. ⟨hal-04521121⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Thierry Hamon, Natalia Grabar. Automatic Prediction of Semantic Labels for French Medical Terms. Medical Informatics Europe conference (MIE2022), May 2022, Nice, France. pp.868-869, ⟨10.3233/SHTI220610⟩. ⟨hal-04519905⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Pierre Lepagnol, Thomas Gerald, Sahar Ghannay, Christophe Servan, Sophie Rosset. Small Language Models are Good Too: An Empirical Study of Zero-Shot Classification. LREC-COLING 2024, May 2024, TURIN, Italy. ⟨hal-04519930v2⟩

    ILES, ILES, STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Mérième Bouhandi, Emmanuel Morin, Thierry Hamon. Graph Neural Networks for Adapting Off-the-shelf General Domain Language Models to Low-Resource Specialised Domains. 2nd Workshop on Deep Learning on Graphs for Natural Language Processing (DLG4NLP 2022), ACL, Jul 2022, Seattle, Washington, United States. pp.36-42, ⟨10.18653/v1/2022.dlg4nlp-1.5⟩. ⟨hal-04517190⟩

    STL

    Year of publication

    Available in free access

  • Poster de conférence

    Elise Lincker, Léa Pacini, Olivier Pons, Camille Guinaudeau, Jérôme Dupire, et al.. MALIN : MAnuels scoLaires INclusifs : Accessibilité numérique des manuels scolaires. Colloque Handiversité 2023 – L’innovation pour le partage, Apr 2023, Gif-sur-Yvette, France. ⟨hal-04410349⟩

    STL

    Year of publication

    Available in free access

  • Article dans une revue

    Angèle Gayet-Ageron, Khaoula Ben Messaoud, Mark Oliver Richards, Cyril Jaksic, Julien Gobeill, et al.. Assessment of gender and geographical bias in the editorial decision-making process of biomedical journals: A Case-Control study.. Medrxiv : the Preprint Server For Health Sciences, 2024, ⟨10.1101/2024.03.15.24304220⟩. ⟨hal-04510221⟩

    STL

    Year of publication

    Available in free access

  • Article dans une revue

    Clement Bernard, Guillaume Postic, Sahar Ghannay, Fariza Tahi. RNAdvisor: a comprehensive benchmarking tool for the measure and prediction of RNA structural model quality. Briefings in Bioinformatics, 2024, 25 (2), pp.bbae064. ⟨10.1093/bib/bbae064⟩. ⟨hal-04508073⟩

    STL

    Year of publication

    Available in free access

  • Article dans une revue

    Anne-Laure Ligozat, Christophe Brun, Benjamin Demirdjian, Guillaume Gouget, Emilie Jardé, et al.. Setting Climate Targets: The Case of Higher Education and Research. BioRxiv, 2024, ⟨10.1101/2024.03.11.584380⟩. ⟨hal-04505199⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Yanis Ouakrim, Hannah Bull, Michèle Gouiffès, Denis Beautemps, Thomas Hueber, et al.. Mediapi-RGB: Enabling Technological Breakthroughs in French Sign Language (LSF) Research through an Extensive Video-Text Corpus. VISAPP 2024 – 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Feb 2024, Rome, Italy. ⟨10.5220/0012372600003660⟩. ⟨hal-04494094⟩

    AMIArchitectures et modèles pour l'Interaction, STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Aurélie Bugeau, Anne-Laure Ligozat. Analysing ICT in prospective scenarios to help reveal undone computer science. Undone Computer Science conference, Feb 2024, Nantes (France), France. ⟨hal-04486589⟩

    STL

    Year of publication

  • Article dans une revue

    Julien Lefevre, Aurélie Bugeau, Jacques Combaz, Laurent Lefèvre, Anne-Laure Ligozat, et al.. Impacts environnementaux de l’IA : quels réels bénéfices ?. Collection numérique de l’AMUE, Agence de mutualisation des universités et établissements d’enseignement supérieur, 2023. ⟨hal-04486682⟩

    STL

    Year of publication

    Available in free access

  • Chapitre d'ouvrage

    Nicholas Asher, Pierre Zweigenbaum. Artificial Intelligence and Language. Pierre Marquis; Odile Papini; Henri Prade. A Guided Tour of Artificial Intelligence Research, III: Interfaces and Applications of Artificial Intelligence (chapter 4), Springer International Publishing, pp.117-145, 2020, 978-3-030-06169-2. ⟨10.1007/978-3-030-06170-8_4⟩. ⟨hal-04483086⟩

    ILES, STL

    Year of publication

  • Proceedings/Recueil des communications

    Reinhard Rapp, Pierre Zweigenbaum, Serge Sharoff. Proceedings of the 13th Workshop on Building and Using Comparable Corpora. LREC 2020, 2020, 979-10-95546-42-9. ⟨hal-04482188⟩

    ILES, ILES, STL

    Year of publication

  • Communication dans un congrès

    Rabab Alkhalifa, Iman Bilal, Hsuvas Borkakoty, Jose Camacho-Collados, Romain Deveaud, et al.. Overview of the CLEF-2023 LongEval Lab on Longitudinal Evaluation of Model Performance. CLEF 2023: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Sep 2023, Thessalonic, Greece. pp.440-458, ⟨10.1007/978-3-031-42448-9_28⟩. ⟨hal-04475726⟩

    ILES, STL

    Year of publication

  • Communication dans un congrès

    Fatima Hamlaoui, Emmanuel-Moselly Makasso, Markus Müller, Jonas Engelmann, Gilles Adda, et al.. BULBasaa: A Bilingual Bàsàá-French Speech Corpus for the Evaluation of Language Documentation Tools. LREC 2018, European Language Resources Association (ELRA), May 2018, Miyazaki, Japan. ⟨hal-04466108⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Yuming Zhai, Gabriel Illouz, Anne Vilnat. Detecting Non-literal Translations by Fine-tuning Cross-lingual Pre-trained Language Models. 28th International Conference on Computational Linguistics (COLING), Dec 2020, Barcelona (on line), Spain. pp.5944-5956, ⟨10.18653/v1/2020.coling-main.522⟩. ⟨hal-04468022⟩

    ILES, STL, STL

    Year of publication

    Available in free access

  • Article dans une revue

    Surya Roca, Sophie Rosset, José García, Álvaro Alesanco. A Study on the Impacts of Slot Types and Training Data on Joint Natural Language Understanding in a Spanish Medication Management Assistant Scenario. Sensors, 2022, 22 (6), pp.2364. ⟨10.3390/s22062364⟩. ⟨hal-04465686⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Laura Spinu, Ioana Vasilescu, Lori Lamel, Jason Lilley. Voicing neutralization in Romanian fricatives across different speech styles. Interspeech, ISCA, Sep 2022, Incheon, South Korea. pp.1342-1346, ⟨10.21437/interspeech.2022-10716⟩. ⟨hal-04465920⟩

    STL, TLP

    Year of publication

    Available in free access

  • Proceedings/Recueil des communications

    Christophe Servan, Anne Vilnat. Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) : volume 5 : démonstrations. CORIA – TALN 2023, 2023. ⟨hal-04462998⟩

    STL

    Year of publication

    Available in free access

  • Proceedings/Recueil des communications

    Christophe Servan, Anne Vilnat. Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) : volume 4 : articles déjà soumis ou acceptés en conférence internationale. CORIA – TALN 2023, 2023. ⟨hal-04462975⟩

    STL

    Year of publication

    Available in free access

  • Proceedings/Recueil des communications

    Christophe Servan, Anne Vilnat. Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) : volume 6 : projets. CORIA – TALN 2023, 2023. ⟨hal-04463005⟩

    STL

    Year of publication

    Available in free access

  • Proceedings/Recueil des communications

    Christophe Servan, Anne Vilnat. Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) : volume 3 : prises de position en TALTraitement Automatique des langues. CORIA – TALN 2023, 2023. ⟨hal-04462921⟩

    STL

    Year of publication

    Available in free access

  • Proceedings/Recueil des communications

    Christophe Servan, Anne Vilnat. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) : volume 2 : travaux de recherche originaux – articles courts. CORIA – TALN 2023, 2023. ⟨hal-04462841⟩

    STL

    Year of publication

    Available in free access

  • Proceedings/Recueil des communications

    Christophe Servan, Anne Vilnat. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) : volume 1 : travaux de recherche originaux – articles longs. CORIA – TALN 2023, 2023. ⟨hal-04462825⟩

    STL

    Year of publication

    Available in free access

  • Ouvrages

    Serge Sharoff, Reinhard Rapp, Pierre Zweigenbaum. Building and Using Comparable Corpora for Multilingual Natural Language Processing. Springer Cham, 2023, Synthesis Lectures on Human Language Technologies, Graeme Hirst, 978-3-031-31383-7. ⟨10.1007/978-3-031-31384-4⟩. ⟨hal-04470213⟩

    STL

    Year of publication

  • Chapitre d'ouvrage

    Serge Sharoff, Reinhard Rapp, Pierre Zweigenbaum. Basic Principles of Cross-Lingual Models. Building and Using Comparable Corpora for Multilingual Natural Language Processing, Springer International Publishing, pp.9-16, 2023, Synthesis Lectures on Human Language Technologies, ⟨10.1007/978-3-031-31384-4_2⟩. ⟨hal-04465490⟩

    STL

    Year of publication

  • Communication dans un congrès

    Bernd Dudzik, Tiffany Matej Hrkalovic, Dennis Küster, David St-Onge, Felix Putze, et al.. The 5th Workshop on Modeling Socio-Emotional and Cognitive Processes from Multimodal Data in the Wild (MSECP-Wild). ICMI ’23: International Conference On Multimodal Interaction, Oct 2023, Paris France, France. pp.828-829, ⟨10.1145/3577190.3616883⟩. ⟨hal-04465456⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    John Mcdonald, Michael Filhol, Camille Challant. Geometric Modifications of Gestures in Sign Languages. International Society for Gesture Studies conference, Jul 2022, Chicago, United States. ⟨hal-03721241⟩

    STL

    Year of publication

  • Communication dans un congrès

    Adam Lion-Bouton, Yağmur Öztürk, Agata Savary, Jean-Yves Antoine. Evaluating Diversity of Multiword Expressions in Annotated Text. International Committee on Computational Linguistics (COLING), International Committee on Computational Linguistics, Oct 2022, Gyeongju, South Korea. pp.3285-3295. ⟨hal-04468662⟩

    STL, STL

    Year of publication

    Available in free access

  • Article dans une revue

    S Cardoso, X Aimé, V Meininger, D Grabli, Lf Melo Mora, et al.. A Modular Ontology for Modeling Service Provision in a Communication Network for Coordination of Care.. Studies in Health Technology and Informatics, 2018, 247, pp.890-894. ⟨hal-02481869⟩

    STL

    Year of publication

  • Communication dans un congrès

    Plinio Barbosa, Philippe Boula de Mareüil. Imitating Broadcast News Style: Commonalities and Differences Between French and Brazilian Professionals. Book cover Book cover International Conference on Computational Processing of the Portuguese Language (PROPOR 2018), Sep 2018, Canela, Brazil. pp.419-428, ⟨10.1007/978-3-319-99722-3_42⟩. ⟨hal-04466213⟩

    STL, STL, TLP

    Year of publication

  • Article dans une revue

    Christopher Norman, Elizabeth Gargon, Mariska Leeflang, Aurélie Névéol, Paula Williamson. Evaluation of an automatic article selection method for timelier updates of the Comet Core Outcome Set database. Database – The journal of Biological Databases and Curation, 2019, 2019, ⟨10.1093/database/baz109⟩. ⟨hal-04466023⟩

    ILES, ILES, STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Patricia Chiril, Farah Benamara, Véronique Moriceau, Kumar Abhishek. The binary trio at SemEval-2019 Task 5: Multitarget Hate Speech Detection in Tweets. 13th International Workshop on Semantic Evaluation (SemEval 2019), Jun 2019, Minneapolis, United States. pp.489-493, ⟨10.18653/v1/S19-2087⟩. ⟨hal-02951036⟩

    STL

    Year of publication

    Available in free access

  • Pré-publication, Document de travail

    Marcely Zanon, Bolaji Yusuf, Lucas Ondel, Aline Villavicencio, Laurent Besacier. Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings. 2021. ⟨hal-03477951⟩

    STL

    Year of publication