From

Time -

Location LISN Site Belvédère

STL, Thesis

Small Generative Models in Industrial Context: Adaptation by Prompting with Limited Data

PhD defense: thesis supervised by Sophie Rosset (Research Director, LISN) and co-supervised by Christophe Séjourné (Data/AI Director, SCIAM), with additional supervision from Sahar Ghannay (Associate Professor, LISN). Thèse effectuée dans le cadre du dispositif CIFRE en collaboration avec la société SCIAM. PhD conducted under the CIFRE program in collaboration with the company SCIAM.

Speaker : Pierre Lepagnol

Keywords

Small Generative Models in Industrial Context: Adaptation by Prompting with Limited Data

Jury

  • François Portet – Professor, LIG (Reporter and Examiner)
  • Frédéric Bechet – Professor, LIS (Reporter and Examiner)
  • Agata Savary – Professor, LISN/SEME (Examiner)
  • Nathalie Camelin – Associate Professor, LIA (Examiner)
  • Benoît Favre – Professor, LIS (Examiner)

Abstract

Natural language processing enables automated text analysis for classification and information extraction. However, developing such systems faces three major challenges: the scarcity of annotated data, limited computational resources. This industrial thesis, conducted with the company SCIAM, explores small generative models for utterance classification, slot filling, and named entity recognition in an industrial setting with limited annotations.

Our work is structured around four axes. The first concerns model selection without annotated data: a zero-shot evaluation of 72 models on 15 classification datasets demonstrates that size is not the determining factor; architecture (seq2seq vs decoder-only) and instruction-tuning play a more significant role, enabling 1–3B parameter models to compete with those exceeding 7B. The second axis focuses on context optimization with few examples: a dynamic prompting approach via information retrieval (BM25) improves slot filling performances up to 21 F1 points compared to intent-based or random selection, across four datasets (ATIS, SNIPS, SLURP, MEDIA), with no inference overhead. The third axis analyzes the impact of structured output format: studying three formats (Key-Value, JSON, XML) on 13 models and 7 datasets reveals gaps of 2 to 46 F1 points, and we propose an automatic selection method that identifies the optimal format at lower cost. Finally, the fourth axis exposes the risks of benchmark contamination and proposes detection methods based on similarity and extraction, enabling assessment of evaluation reliability.

This work, accompanied by public code and reproducible protocols, establishes methodological foundations for efficient, auditable NLP systems suited to industrial constraints.

Publications

  • Communication dans un congrès

    Pierre Lepagnol, Thomas Gerald, Sahar Ghannay, Christophe Servan, Sophie Rosset. Les petits modèles sont bons : une étude empirique de classification dans un contexte zero-shot. 31ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2024), Jul 2024, Toulouse, France. pp.113-129. ⟨hal-04623012v2⟩

    STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Pierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset. Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning. Interspeech 2025, Aug 2025, Rotterdam, Netherlands. ⟨10.21437/Interspeech.2025-175⟩. ⟨hal-05095796⟩

    STL, STL

    Year of publication

    Available in free access

  • Communication dans un congrès

    Pierre Lepagnol, Thomas Gerald, Sahar Ghannay, Christophe Servan, S. Rosset. Détection des contaminations de LLM par extraction de données : une revue de littérature pratique. 20e Conférence en Recherche d’Information et Applications (CORIA) 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN) 27ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL) Les 18e Rencontres Jeunes Chercheurs en RI (RJCRI), Jul 2025, Marseille, France. pp.233-251. ⟨hal-05330614⟩

    Year of publication

    Available in free access

Virtual room: here