Axis : IID
Subject : IID-5 Natural Language Processing
Supervisors : Claire Nedellec, Pierre Zweigenbaum, Louise Deléger
Managing laboratory : MaIAGE-INRAE
Other partners : LISN
PhD student : Anfu Tang
Starting year : 2020
Abstract : This thesis addresses the extraction of relational information from scientific documents in Life Sciences, i.e. transforming unstructured text into machine-readable structured information. The extraction of semantic relationships between entities detected in text makes explicit and formalizes the underlying structures. Current state-of-the art methods rely on supervised machine learning. Supervised learning, and even more so recent deep learning methods, require many training examples that are costly to produce, all the more in specific domains such as Life Sciences. We hypothesize that combining information and knowledge available in specific domains with the latest deep learning word embedding models can offset the absence or limited amount of annotated training data. For this purpose, the thesis will design a rich representation of texts that draws both from linguistic information obtained from syntactic parsing and domain knowledge obtained from knowledge graphs such as ontologies. Integrating ontologies in the information extraction process will additionally facilitate information integration with other data, such as experimental or analytical data.
Participation in DigiCosme ResearchDays 2020
Anfu Tang has presented his first research results online on November 10, 2020. His talk entitled “Extraction of relational information from text in the biological domain” introduced the thesis objectives and the preliminary results obtained by a combination of Global Alignment of Shortest Dependency Path and SVM.