Embeddings for texts, graphs, and relations
Abstract:
Currently, the most successful machine learning methods are numeric, e.g., deep neural networks or SVMs. On the other hand, many important real-world problems use symbolic representation, e.g., graphs, relations, texts, or electronic health records. If we are to harness the power of successful numeric deep learning approaches for these learning problems, the symbolic data has to be embedded into a numeric vector form, suitable for numeric algorithms. The embeddings shall preserve the information in the form of similarities and relations contained in the original data by encoding it into distances and directions in the numeric space. For example, in graphs, nodes representing similar entities or having connections with similar other nodes shall have similar numerical representations.
In the tutorial, we are going to present embeddings of unstructured data, such as texts, graphs, and relations. We will use text to introduce the main ideas exploited in successful embeddings: transfer learning and unsupervised approaches. More specifically, we will cover matrix factorization based LSA and language model based word2vec. As these embeddings do not cover well the ambiguity of language, we will present modern contextual embeddings such as ELMo and BERT. In graphs, we will first present random-walk based embeddings such as nodevec and HINMINE, but also touch recent graph convolutional networks.
The most general form of embeddings can use any similarity-based function to embed different entities. We will describe the idea of StarSpace embedding technique and show how to adapt it for relations.
Biographical note:
Marko Robnik-Sikonja is Professor of Computer Science and Informatics and Head of Artificial Intelligence Chair at the University of Ljubljana, Faculty of Computer and Information Science. His research interests span machine learning, data mining, natural language processing, network analytics, and application of data science techniques. He is (co)author of over 150 scientific publications that were cited more than 4,500 times. He is author and maintainer of three open-source R data mining packages.