Learning Representations of Abstraction in Text

Social media has great democratising potential, but the vast diversity of opinions makes it difficult to understand what everyone really thinks. If we could automatically extract the consensus opinions and major current issues from everyone's opinions, and broadcast that summary to everyone, it would be a powerful new channel of social communication. The overall objective of this project is to solve the fundamental technical challenges required for this large-scale opinion summarisation, including how to model semantic abstraction in text. The impressive recent advances in natural language understanding brought by deep learning and related latent vector models put us in a position to solve these challenges. In articular, work by the PI on modelling abstraction in a vector space has already improved models of abstraction between words. The specific aims of this project are to extend this research to models of semantic abstraction between texts and models of summarising large sets of opinions. Detecting when one text is an abstraction of another (also known as textual entailment) is a fundamental problem in natural language semantics. Task~1 of this project will develop textual entailment, including associated machine learning methods for modelling abstraction and unification. Extending the PI's model of abstraction between words hyponymy, or lexical entailment) will require modelling both semantic composition (unification) and complex non-compositional reasoning. This work will leverage a long-standing theme of the PI's research, non-parametric vector space representations (where the number of vectors grows with the complexity of the semantics). Using a bag of entailment ectors to represent the semantics of a text will allow attention-based deep learning architectures to model both the semantic structureof language and the complex reasoning needed for the general case of textual entailment. Evaluation will be on benchmark textual entailment datasets, such as the Stanford Natural Language Inference corpus, using semi-supervised learning. Such models of abstraction in text will enable advances in summarising sets of opinions. Task~2 will develop models of opinion summarisation, including associated machine learning methods for modelling abstraction and intersection. Abstraction is crucial for controlling the complexity of summaries, given that everyone's opinion is different. The summary should include the consensus opinions and major dimensions of disagreement, which are abstract statements entailed by large proportions of the opinions intersection). We will develop entailment-based methods for clustering, and for generating statements from bag-of-entailment-vector representations. As well as exploiting the relevant available corpora, evaluation will involve an initial stage of data collection, annotation and analysis, and establishing new benchmark measures and results for opinion summarisation, using unsupervised and semi-supervised learning. A third aim is to develop the above models even for languages where we don't have the data. Task~3 will develop multi-lingual and cross-lingual models of textual entailment and opinion summarisation, through multi-task learning with machine translation. This work will extend attention-based neural machine translation models to induce shared representations between multi-lingual neural machine translation and the above models of textual entailment and opinion summarisation. Evaluation will be done on the same data as above where some statements have been translated. This project will result in both fundamental contributions to machine learning and natural language understanding, and enabling technology for large scale opinion summarisation. We anticipate related projects for leveraging these advances into deployed tools for large group communication, thereby advancing the tradition of direct democracy of which Switzerland is a world leader.
Idiap Research Institute
Swiss National Science Foundation
Nov 01, 2018
Oct 31, 2022