Small Generative Models in Industrial Context: Adaptation by Prompting with Limited Data

PhD defense: thesis supervised by Sophie Rosset (Research Director, LISN) and co-supervised by Christophe Séjourné (Data/AI Director, SCIAM), with additional supervision from Sahar Ghannay (Associate Professor, LISN). Thèse effectuée dans le cadre du dispositif CIFRE en collaboration avec la société SCIAM. PhD conducted under the CIFRE program in collaboration with the company SCIAM.

Speaker : Pierre Lepagnol

Add to my calendar

Keywords

Small Generative Models in Industrial Context: Adaptation by Prompting with Limited Data

Jury

François Portet – Professor, LIG (Reporter and Examiner)
Frédéric Bechet – Professor, LIS (Reporter and Examiner)
Agata Savary – Professor, LISN/SEME (Examiner)
Nathalie Camelin – Associate Professor, LIA (Examiner)
Benoît Favre – Professor, LIS (Examiner)

Abstract

Natural language processing enables automated text analysis for classification and information extraction. However, developing such systems faces three major challenges: the scarcity of annotated data, limited computational resources. This industrial thesis, conducted with the company SCIAM, explores small generative models for utterance classification, slot filling, and named entity recognition in an industrial setting with limited annotations.

Our work is structured around four axes. The first concerns model selection without annotated data: a zero-shot evaluation of 72 models on 15 classification datasets demonstrates that size is not the determining factor; architecture (seq2seq vs decoder-only) and instruction-tuning play a more significant role, enabling 1–3B parameter models to compete with those exceeding 7B. The second axis focuses on context optimization with few examples: a dynamic prompting approach via information retrieval (BM25) improves slot filling performances up to 21 F1 points compared to intent-based or random selection, across four datasets (ATIS, SNIPS, SLURP, MEDIA), with no inference overhead. The third axis analyzes the impact of structured output format: studying three formats (Key-Value, JSON, XML) on 13 models and 7 datasets reveals gaps of 2 to 46 F1 points, and we propose an automatic selection method that identifies the optimal format at lower cost. Finally, the fourth axis exposes the risks of benchmark contamination and proposes detection methods based on similarity and extraction, enabling assessment of evaluation reliability.

This work, accompanied by public code and reproducible protocols, establishes methodological foundations for efficient, auditable NLP systems suited to industrial constraints.

Publications

Communication dans un congrès

Pierre Lepagnol, Thomas Gerald, Sahar Ghannay, Christophe Servan, Sophie Rosset. Les petits modèles sont bons : une étude empirique de classification dans un contexte zero-shot. 31ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2024), Jul 2024, Toulouse, France. pp.113-129. ⟨hal-04623012v2⟩

STL

Year of publication 2024

Available in free access

HAL publication
Communication dans un congrès

Pierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset. Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning. Interspeech 2025, Aug 2025, Rotterdam, Netherlands. ⟨10.21437/Interspeech.2025-175⟩. ⟨hal-05095796⟩

STL, STL

Year of publication 2025

Available in free access

HAL publication
Communication dans un congrès

Pierre Lepagnol, Thomas Gerald, Sahar Ghannay, Christophe Servan, S. Rosset. Détection des contaminations de LLM par extraction de données : une revue de littérature pratique. 20e Conférence en Recherche d’Information et Applications (CORIA) 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN) 27ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL) Les 18e Rencontres Jeunes Chercheurs en RI (RJCRI), Jul 2025, Marseille, France. pp.233-251. ⟨hal-05330614⟩

Year of publication 2025

Available in free access

HAL publication

All Publications

Virtual room: here

Keywords

Jury

Abstract

Publications

Pierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset. Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning. Interspeech 2025, Aug 2025, Rotterdam, Netherlands. ⟨10.21437/Interspeech.2025-175⟩. ⟨hal-05095796⟩