Reference ontologies play an essential role in organising knowledge in
the life sciences and other domains. They are built and maintained
manually. Since this is an expensive process, many reference ontologies
only cover a small fraction of their domain. The goal of this project is
to develop techniques that enable the automatic extension of the
coverage of a reference ontology by extending it with classes that have
not yet been manually added. The extension shall be faithful to the
(often implicit) design decisions by the developers of the reference
ontology. While this is a generic problem, our use case addresses the
automatic extension of the Chemical Entities of Biological Interest
(ChEBI) ontology with classes of molecules, since the chemical domain is
particularly suited to our approach. We achieve our goal by using the
leaf classes of the manually curated reference ontology to train a
system to predict subclass relationships between mid-level classes and
new classes. Thus, our method uses machine learning techniques, but – in
contrast to other approaches – does not rely on text corpora as input,
but uses the content of the ontology itself. Annotations of classes that
provide information that are relevant for the classification of a given
entity within the ontology play a key role in this learning task. E.g.,
in the case of ChEBI these are annotations that represent the structure
of chemical entities (e.g., molecules and functional groups). In
addition, the axioms of the ontology are represented as logical neural
networks, which are used during the training of prediction models. Thus,
our approach for ontology extension uses neural-symbolic integration.
In our previous work we have established the feasibility of the approach
by comparing the performance of a number of machine learning approaches
at subclass prediction. In spite of the limitations of this initial
work, the performance of some of our models compare positively to
ClassyFire. The latter is a rule-based system representing the state of
the art for this task, and is already being used in the development of
ChEBI. Furthermore, our results show that different machine learning
approaches are suited for different kinds of chemical entities. Thus, we
plan to use an ensemble approach in our project. The outcomes of this
project will be (a) a benchmark training set for training models for
chemical ontology extension, and (b) a system that – when provided with a
set of new chemical entities as input – will automatically generate a
new ontology that extends ChEBI to cover these entities. The benefit of
this work is a novel methodology for extending the coverage of existing
reference ontologies. If adopted, it will allow improved
interoperability and knowledge integration for the communities that use
these reference ontologies. Another benefit will be a novel
neural-symbolic architecture, integrating graph neural networks,
transformers and logical neural networks. We will also explore methods
for explainability of neural networks using neural-symbolic approaches.