Lilia Berrahou PhD defense

Lilia Berrahou PhD defense took place the 29th of September 2015 at LIRMM Montpellier

The title of the PhD is N-ary relation arguments extraction from texts guided by a domain OTR.

Today, a huge amount of data is made available to the research community through several
web-based libraries. Enhancing data collected from scientific documents is a major
challenge in order to analyze and reuse efficiently domain knowledge. To be enhanced,
data need to be extracted from documents and structured in a common representation
using a controlled vocabulary as in ontologies. Our research deals with knowledge engineering
issues of experimental data, extracted from scientific articles, in order to reuse
them in decision support systems. Experimental data can be represented by n-ary relations
which link a studied object (e.g. food packaging, transformation process) with its
features (e.g. oxygen permeability in packaging, biomass grinding) and capitalized in an
Ontological and Terminological Ressource (OTR). An OTR associates an ontology with
a terminological and/or a linguistic part in order to establish a clear distinction between
the term and the notion it denotes (the concept).
Our work focuses on n-ary relation extraction from scientific documents in order to populate
a domain OTR with new instances. Our contributions are based on Natural Language
Processing (NLP) together with data mining approaches guided by the domain
OTR. More precisely, firstly, we propose to focus on unit of measure extraction which
are known to be difficult to identify because of their typographic variations. We propose
to rely on automatic classification of texts, using supervised learning methods, to reduce
the search space of variants of units, and then, we propose a new similarity measure that
identifies them, taking into account their syntactic properties. Secondly, we propose to
adapt and combine data mining methods (sequential patterns and rules mining) and syntactic
analysis in order to overcome the challenging process of identifying and extracting
n-ary relation instances drowned in unstructured texts.