Graphs, Computation, and Language (GCL) is a series of tutorials by Dmitry Ustalov dedicated to human-computer interaction systems, graph theory, and computational linguistics. The tutorials discuss their real-world applications and elaborately go through the corresponding algorithms step-by-step. The slides are licensed under a CC BY-NC-SA 4.0 license and are available via the DOI link below.

GCL was offered in person at the 33rd European Summer School in Logic, Language and Information (ESSLLI 2022)!

Graphs, Computation, and Language (doi:10.5281/zenodo.6667766)

Language Phenomena and Graphs

This lecture will revise the basic concepts from graph theory and computer science to define the graph as a mathematical object and as a data structure for the rest of the course. We will present essential graph algorithms, such as breadth- and depth-first search, shortest path, random walks, etc. Then, we will focus on statistical properties of language networks, such as degree distribution and centrality, with examples from co-occurrence, syntactic, and semantic networks derived from language resources. We will present essential algorithms for link analysis, including PageRank and HITS, often used in NLP and Information Retrieval (IR). Finally, we will show successful applications of these algorithms for keyword extraction and text summarization.

Graph Clustering for Natural Language Processing

Graph clustering, as an unsupervised learning technique, captures the implicit structure of the data. This lecture will introduce the problem of graph clustering in two kinds: non-overlapping clustering, aka hard clustering or partitioning, and overlapping clustering, aka soft or fuzzy clustering. We will present linguistic grounding for such an unsupervised task and demonstrate case studies for such real-world tasks involving graph clustering as word sense induction, synset and semantic frame induction, and estimation of sense-aware word embeddings. We will describe and elaborate several efficient clustering algorithms widely used in NLP, including Chinese Whispers, Markov Clustering, and Louvain for hard clustering and MaxMax and Watset for soft clustering. We will show their strengths and weaknesses as well as essential implementation details and theoretical properties. Finally, we discuss important datasets obtained via graph clustering and popular datasets used in studying graph clustering.

Graph Embeddings for Natural Language Processing

Graphs hold information on the relationships between objects, which is helpful in various machine learning pipelines. This lecture will be devoted to efficient methods for learning latent numerical representations of graphs. We will focus on corpus-based graph embedding methods, DeepWalk and node2vec, that use random walks and Word2Vec to learn the node embeddings; we will pay special attention to explaining Word2Vec computation steps. We will describe other unsupervised techniques, including Laplacian Eigenmaps and StarSpace, and such semi-supervised and supervised approaches as Graph Convolutional Networks and Graph Attention Networks, that encode both local graph structure and features of nodes. Finally, we will discuss the available implementations of the described techniques and their performance in downstream NLP tasks, such as word similarity, text classification, and relation extraction.

Knowledge Graphs and Natural Language Processing

Knowledge graphs are powerful representations of such linguistic relations as hypernymy (is-a) and meronymy (has-a) that form a hierarchy. This lecture will review the approaches for representing and using knowledge of the real world in NLP applications. First, we will discuss how the knowledge can be gathered from text corpora using lexico-syntactic patterns, converted into a fully-fledged taxonomy, and embedded in a way that preserves the hierarchy of entities. Then, we will discuss the Semantic Web and Linked Data initiatives, showing examples of important open datasets and cross-referencing between them. We will proceed with learning embeddings of these multi-relational graphs using TransE (aka DistAdd) as an example. Finally, we will present several applications of knowledge graphs for word sense disambiguation and entity linking, retrofitting word embeddings, and question answering.

Evaluation in Natural Language Processing

This lecture will provide a broad overview of robust approaches for the quality assessment of NLP methods. First, we will discuss the need for ground truth data for evaluation and ways for fulfilling this need using expert assessment, gold and silver standard datasets, and crowdsourcing. Then, we will demonstrate the connection between the concepts from decision support systems and the IR evaluation methodology. We will present and provide grounding for accuracy, precision, recall, F-score, Matthews correlation coefficient evaluation scores, and other quality criteria for supervised classification problems. We will proceed to unsupervised clustering evaluation in which we will focus on the practical external quality criteria, such as pairwise scores and purity. We will discuss their variations and modifications. We will address the problem of ranked evaluation, for which we will cover the corresponding scores, including average precision, normalized discounted cumulative gain, expected reciprocal rank aka pFound, etc. Having demonstrated all these evaluation criteria, we will finally describe traditional and randomized statistical tests for them, and the inter-rater agreement evaluation approach with Krippendorff’s α.

Crowdsourcing for Language Resources and Evaluation

Not a part of the ESSLLI 2022 course.

Crowdsourcing is an efficient approach for knowledge acquisition and data annotation that enables gathering and evaluating large-scale linguistic datasets. This lecture will focus on the practical use of human-assisted computation for language resource construction and evaluation. We will analyze three established approaches for crowdsourcing in NLP. First, we will consider case studies of Wikipedia and Wiktionary that facilitate the community effort using automatic quality control via content assessment and edit patrolling. Second, we will dive deep into microtask-based crowdsourcing using IR evaluation, reCAPTCHA, and Mechanical Turk as examples. We will discuss task design and decomposition issues and then thoroughly describe popular techniques for computational quality control: Majority Vote, Dawid-Skene, and Bradley-Terry. Third, we will study the case of various games with a purpose, including ESP Game, Infection Game for BabelNet, and OpenCorpora gamification. Finally, we will provide recommendations for ensuring the high quality of the crowdsourced annotation and show relevant datasets for further studies.