corpus

PARSEME corpora annotated for verbal multiword expressions

Parseme corpus

Création le : 01-20-2017

Mis a jour le : 05-10-2023

Agata Savary

En maintenance

CC, GNU GPL

This multilingual resource contains corpora in which verbal MWEs have been manually annotated. The data covers 26 languages corresponding to the combination of the corpora for all previous three editions (1.0, 1.1 and 1.2) of the corpora. VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format. Morphological and syntactic information, including parts of speech, lemmas, morphological features and/or syntactic dependencies, are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). All corpora are split into training, development and test data, following the splitting strategy adopted for the PARSEME Shared Task 1.2. More info: https://gitlab.com/parseme/corpora/-/wikis/home

Notes :

release 1.0: http://hdl.handle.net/11372/LRT-2282
release 1.1: http://hdl.handle.net/11372/LRT-2842
release 1.2: http://hdl.handle.net/11234/1-3367