From
Time -
Location LISN Site Belvédère
Languages Sciences and technology, Thesis
Speaker : Shu OKABE
Computational language documentation, word segmentation, Bayesian non-parametric model, interlinear gloss generation, weak supervision, field linguistics
In the wake of the threat of extinction of half of the languages spoken today by the end of the century, language documentation is a field of linguistics notably dedicated to the recording, annotation, and archiving of data. In this context, computational language documentation aims to devise tools for linguists to ease several documentation steps through natural language processing approaches. As part of the CLD2025 computational language documentation project, this thesis focuses mainly on two tasks: word segmentation to identify word boundaries in an unsegmented transcription of a recorded sentence and automatic interlinear glossing to predict linguistic annotations for each sentence unit. For the first task, we improve the performance of the Bayesian non-parametric models used until now through weak supervision. For this purpose, we leverage realistically available resources during documentation, such as already-segmented sentences or dictionaries. Since we still observe an over-segmenting tendency in our models, we introduce a second segmentation level: morphemes. Our experiments with various types of two-level segmentation models indicate a slight improvement in the segmentation quality. Still, we also face the limitations in differentiating words from morphemes, using statistical cues only The second task concerns the generation of either grammatical or lexical glosses. As the latter cannot be predicted using training data solely, our statistical sequence-labelling model adapts the set of possible labels for each sentence and provides a competitive alternative to the most recent neural models.