Weakly Supervised Models for Computational Language Documentation

Thèse sous la direction de François YVON

Speaker : Shu OKABE

Add to my calendar

Thesis commitee

Claire Gardent, senior researcher at LORIA (CNRS, Université de Lorraine) ;
Alexis Nasr, professor at LIS (CNRS, UTLN, Université Aix Marseille) ;
Roland Kuhn, Principal Research Officer at the National Research Council Canada ;
François Pellegrino, senior researcher at DDL – ISH (CNRS, Université de Lyon 2) ;
Agata Savary, professor at LISN (CNRS, Université Paris-Saclay) ;
Laurent Besacier, Principal Scientist at Naver Labs Europe and professor at Université Grenoble-Alpes ;
François Yvon, doctoral advisor, senior researcher at ISIR (CNRS, Sorbonne Université).

Keywords

Computational language documentation, word segmentation, Bayesian non-parametric model, interlinear gloss generation, weak supervision, field linguistics

Abstract

In the wake of the threat of extinction of half of the languages spoken today by the end of the century, language documentation is a field of linguistics notably dedicated to the recording, annotation, and archiving of data. In this context, computational language documentation aims to devise tools for linguists to ease several documentation steps through natural language processing approaches. As part of the CLD2025 computational language documentation project, this thesis focuses mainly on two tasks: word segmentation to identify word boundaries in an unsegmented transcription of a recorded sentence and automatic interlinear glossing to predict linguistic annotations for each sentence unit. For the first task, we improve the performance of the Bayesian non-parametric models used until now through weak supervision. For this purpose, we leverage realistically available resources during documentation, such as already-segmented sentences or dictionaries. Since we still observe an over-segmenting tendency in our models, we introduce a second segmentation level: morphemes. Our experiments with various types of two-level segmentation models indicate a slight improvement in the segmentation quality. Still, we also face the limitations in differentiating words from morphemes, using statistical cues only The second task concerns the generation of either grammatical or lexical glosses. As the latter cannot be predicted using training data solely, our statistical sequence-labelling model adapts the set of possible labels for each sentence and provides a competitive alternative to the most recent neural models.