From

Time -

Location LISN Site Belvédère

Languages Sciences and technology, Thesis

Information Extraction from Electronic Health Records: Studies on temporal ordering, privacy and environmental impact

Speaker : Nesrine Bannour

Jury

  • Maxime AMBLARD (Professeur des Universités, Université de Lorraine) : Rapporteur & Examinateur
  • Timothy MILLER (Assistant Professor, Harvard University, Boston Children’s Hospital) : Rapporteur & Examinateur
  • Fleur MOUGIN (Professeure des Universités, Université de Bordeaux) : Examinatrice
  • Fatiha SAIS (Professeure des Universités, Université Paris Saclay) : Examinatrice
  • Bastien RANCE (Maître de conférences des Universités – Praticien Hospitalier, HEGP, Université Paris Cité) : Co-superviseur
  • Xavier TANNIER (Professeur des Universités, LIMICS, Sorbonne Université) : Co-superviseur
  • Aurélie NÉVÉOL (Directrice de Recherche, LISN, CNRS) : Directrice de thèse

Abstract

Automatically extracting rich information contained in Electronic Health Records (EHRs) is crucial to improve clinical research. However, most of this information is in the form of unstructured text. The complexity and the sensitive nature of clinical text involve further challenges. As a result, sharing data is difficult in practice and is governed by regulations. Neural-based models showed impressive results for Information Extraction, but they need significant amounts of manually annotated data, which is often limited, particularly for non-English languages. Thus, the performance is still not ideal for practical use. In addition to privacy issues, using deep learning models has a significant environmental impact. In this thesis, we develop methods and resources for clinical Named Entity Recognition (NER) and Temporal Relation Extraction (TRE) in French clinical narratives.

Specifically, we propose a privacy-preserving mimic models architecture by exploring the mimic learning approach to enable knowledge transfer through a teacher model trained on a private corpus to a student model. This student model could be publicly shared without disclosing the original sensitive data or the private teacher model on which it was trained. Our strategy offers a good compromise between performance and data privacy preservation.
Then, we introduce a novel event- and task-independent representation of temporal relations. Our representation enables identifying homogeneous text portions from a temporal standpoint and classifying the relation between each text portion and the document creation time. This makes the annotation and extraction of temporal relations easier and reproducible through different event types, as no prior definition and extraction of events is required.
Finally, we conduct a comparative analysis of existing tools for measuring the carbon emissions of NLP models. We adopt one of the studied tools to calculate the carbon footprint of all our created models during the thesis, as we consider it a first step toward increasing awareness and control of their environmental impact.

To summarize, we generate shareable privacy-preserving NER models that clinicians can efficiently use. We also demonstrate that the TRE task may be tackled independently of the application domain and that good results can be obtained using real-world oncology clinical notes.