All the downloadable datasets included in this notebook are subject to the following licence: Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0). If you reuse any of them, please cite the dataset in connection with this notebook.
Computational linguistics offers promising tools for tracking language change in diachronic corpora. These tools exploit distributional semantic models, both old and new. DSMs tend to perform well at the level of lexical semantics but are more difficult to fine-tune when it comes to capturing grammatical meaning.
I present ways in which the above can be improved. I start from well-trodden methodological paths implemented in diachronic construction grammar: changes in the collocational patterns of a linguistic unit reflect changes in meaning/function; distributional word representations can be supplemented with frequency-based methods. I move on to show that when meaning is apprehended with predictive models (e.g. word2vec), one can trace semantic shifts with greater explanatory power than with count models. Although this idea may sound outdated from the perspective of NLP, it actually goes great ways from the viewpoint of theory-informed corpus linguistics.
I illustrate the above with several case studies, one of which involves complex locative prepositions in the Corpus of Historical American English. I conclude my talk by defending the idea that NLP, with its focus on computational efficiency, and corpus-linguistics, with its focus on tools that maximize data inspection, have much to gain from getting closer.
Because of the topics covered by this seminar, I have decided to focus my talk on methodological issues and share some thoughts on my practice as a corpus linguist with an NLP leaning. It is why I have decided to illustrate my talk with a notebook rather than slides.
Distributional semantic models (henceforth DSMs) are computational implementations of the distributional hypothesis: words that occur in similar contexts tend to have similar meanings (Harris 1954; Firth 1957; Miller and Charles 1991).
Initially developed in the field of cognitive psychology to model memory acquisition (Landauer and Dumais 1997; Lund and Burgess 1996), DSMs have been used extensively in NLP in the wake of Turney and Pantel (2010).
DSM is a cover term for a great number of methodologically related yet distinct approaches. Three features influence shape the kind of distributional semantic modeling that you do:
word2vec) vs. token (BERT)In any case, DSMs embrace the ‘Bag-of-Words’ approach. On the one hand, semantic modeling works well at the lexical level but not so much at more complex levels. We shall see why briefly, and I will propose a workaround based on previous works. The approach I propose taps into the computational force of NLP and the methodological intuitions of corpus linguistics. It combines the assets of collocational analysis and DSMs.
Another related issue has to do with diachronic linguistics. In quite general terms, diachronic linguistics is the study of language change. The methodological implications are not that simple because doing diachronic linguistics depends on your theory of language and on what the linguist considers is a relevant linguistic unit for the study of change (morphemes? lexemes? syntactic patterns? etc.)
I will address these issues in a hybrid manner, i.e. via a combination of theoretical reflections and practice (i.e. with R code).
After introducing the foundations of DSMs, I will move on to a review of their applications in diachrony, comparing one NLP approach to corpus-linguistic approaches. I will present two case studies: the split infinitive and the internal-location construction.
I argue for a double requirement: maximizing the quality of the vector representation, and respecting the nature of the linguistic unit.
DSMs are used to produce semantic representations of words from co-occurrence matrices, i.e. tables of co-occurring words, with target words as rows, and their neighbors as columns.
Originally, a co-occurrence matrix is populated with frequency counts (how many times the target word and its neighbors co-occur) and each row is an array of such frequencies, also known as a vector. The semantic representation produced by DSMs is therefore numeric.
Semantic similarities are apprehended in terms of proximities and distances between word vectors.
source: “Word embeddings: the (very) basics,” in Around the Word, 25/04/2018
Suppose we have a mini corpus with 7 words:
These words are found in and 3 contexts:
Each word is characterized by 3 coordinates which correspond to the number of times the word is found in each context. For example, helicopter is not found in the wings context and it occurs twice and four times in the contexts engine and sky, respectively. Its coordinates are therefore (0,2,4).
It is customary to collect all coordinates in a matrix such as the one below.
> m <- matrix(c(3,0,2,3,0,3,2,0,4,0,2,4,0,3,3,0,4,2,1,1,1), nrow=7, ncol=3, byrow=T)
> rownames(m) <- c("bee", "eagle", "goose", "helicopter", "drone", "rocket", "jet")
> colnames(m) <- c("wings", "engine", "sky")
> m
wings engine sky
bee 3 0 2
eagle 3 0 3
goose 2 0 4
helicopter 0 2 4
drone 0 3 3
rocket 0 4 2
jet 1 1 1
Each line is a vector. The vectors contained in the matrix are said to be explicit because each dimension corresponds to a well-identified context.
Most of the time, matrices of explicit vectors contain many “empty” cells, i.e. cells whose value is null. These are known as sparse matrices.
The toy matrix is deliberately simple as each vector is three-dimensional. In the real word, the matrix can easily reach several thousand lines and columns, depending on the size of the corpus.
Each word occupies a specific position in the vector space, as represented in Fig. 2.1.
Figure 2.1: A vector space of 7 words in 3 contexts
The word vector is the arrow from the point where all three axes intersect to the end point defined by the coordinates.
The presupposition underlying word embeddings is that semantic similarities are indexed on contextual affinities. For example, helicopter and drone are close because they occur in similar contexts, have similar vector profiles, and are therefore close in the vector space.
Although this results in a simplistic view of meaning, a nice consequence is that vector coordinates can be used to calculate the proximities between words. This is done with cosine similarity (\(cos~\theta\)), i.e. the cosine of the angle between two word vectors (Fig. 2.2).
Figure 2.2: Cosine similarities between ‘helicopter’ and ‘drone,’ and between ‘drone’ and ‘rocket’
Let us see briefly how cosine similarity is measured. Let \(\vec{a}\) and \(\vec{b}\) denote two vectors. Cosine similarity between \(\vec{a}\) and \(\vec{b}\) is calculated as follows:
\[ cos~\theta = \frac{\vec{a}\cdot\vec{b}}{\|\vec{a}\|\|\vec{b}\|} \]
#install.packages("lsa")
library(lsa)
Loading required package: SnowballC
cos <- round(cosine(t(m)), 2)
cos
bee eagle goose helicopter drone rocket jet
bee 1.00 0.98 0.87 0.50 0.39 0.25 0.80
eagle 0.98 1.00 0.95 0.63 0.50 0.32 0.82
goose 0.87 0.95 1.00 0.80 0.63 0.40 0.77
helicopter 0.50 0.63 0.80 1.00 0.95 0.80 0.77
drone 0.39 0.50 0.63 0.95 1.00 0.95 0.82
rocket 0.25 0.32 0.40 0.80 0.95 1.00 0.77
jet 0.80 0.82 0.77 0.77 0.82 0.77 1.00
Theoretically, similarity scores range from \(-1\) (complete opposition) to \(1\) (identity). A score of \(0\) indicates orthogonality (decorrelation). Values in between indicate intermediate degrees of similarity (between \(0\) and \(1\)) or dissimilarity (between \(0\) and \(-1\)). Here, the cosine similarities range from \(0\) to \(1\), since the word frequencies are not negative. The angle between two word-vectors is not greater than \(90°\).
Because the matrix is symmetric, it is divided into two parts (two triangles) on either side of the diagonal of exact similarity (i.e. \(cos~\theta = 1\)) between the same words.
The largest dissimilarity is observed between bee and rocket (\(cos~\theta = 0.25\)). The largest similarity is observed between bee and eagle (\(cos~\theta = 0.98\)).
There are other (dis)similarity metrics One of them is Euclidean distance:
dist.object <- dist(m, method="euclidean", diag=T, upper=T)
dist.matrix <- as.matrix(dist.object)
We can represent the above graphically with a method known as Multidimensional Scaling (MDS). MDS is very popular because it is relatively old, versatile, and easy to understand and implement. It is a multivariate data analysis approach that is used to visualize distances in multidimensional maps (in general: two-dimensional plots).
mds <- cmdscale(dist.matrix,eig=TRUE, k=2)
x <- mds$points[,1]
y <- mds$points[,2]
plot(x, y, xlab="Dim.1", ylab="Dim.2", type="n")
text(x, y, labels = row.names(m), cex=.7)
Of course, in their natural environment, word meanings do not let themselves be captured so easily:
We must apply some kind of weighting to enhance the contribution of the most revealing collocates. A weight is added to a collocate when its association with the target word is statistically significant.
Common weighting measures:
More info on this: an online tutorial by Andreas Niekler & Gregor Wiedemann
A matrix generates as many dimensions as it has columns. To summarize a matrix, we need a method to reduce the number of dimensions to a few. These are meaningful and can mapped onto a Euclidean space for easy visual inspection.
Several methods exist:
See Chap. 10 of Desagulier (2017).
DSMs have been applied to the study of diachrony in NLP (Sagi, Kaufmann, and Clark 2009, 2011) and, more recently, corpus-based cognitive semantics and construction grammar.
Below, I illustrate two kinds of models (count vs. predictive) based on a couple of inspiring papers in NLP and corpus-based construction grammar.
DSMs that rely on count models, i.e. models whose vectors are generated fromco-occurrence counts, are common in corpus linguistics. Such vectors are:
Hilpert observes that may entertains a complex network of associations with the lexical verbs that it governs and that it has shifted away from the expression deontic modal meanings towards epistemic meanings and a higher degree of informativeness (Fig. 3.1).
Figure 3.1: A reference SVS of the 250 most frequent verbal collocates of may (left) and density changes in the SVS of may
Perek finds that the three senses of the way-construction (path-creation, manner, and incidental-action) have gained in semantic diversity. More precisely, the schematicity of the verb slot or the motion component contributed by the construction has increased, alongside its productivity.
Predictive models are inspired by neural language models Collobert et al. (2011). Instead of counting how often a collocate \(c\) occurs near a target word \(w\), predictive models estimate the probability of finding \(c\) near \(w\). The resulting vectors are
word2vecMikolov, Yih, and Zweig (2013)
CBOW predicts a word given its context. It has been shown to outperform count models on a variety of Natural Language Processing tasks such as semantic relatedness, synonymy detection, selectional preferences, and analogy (Baroni, Dinu, and Kruszewski 2014). Levy, Goldberg, and Dagan (2015) warn that Baroni et al’s comparison is unfair and observe that PPMI and SVD perform equally well if fine-tuned with an ad-hoc combination of hyperparameters (context window, subsampling, deletion of rare words, negative sampling, context distribution smoothing, etc.).
Hamilton, Leskovec, and Jurafsky (2016) show that Skip-Gram with Negative Sampling (SGNS) (Mikolov, Yih, and Zweig 2013), the alternative model of the word2vec toolkit, outperforms PPMI and SVD in the discovery of new shifts and the visualization of changes.
SGNS predicts a word’s context given the word itself, a task that is complementary to the one addressed by CBOW.
One feature of SGNS that is of particular interest to usage based linguists is that each word \(W_i\) is represented by two short, dense vectors: a word vector \(w_i\) and a context vector \(c_i\). The final vector of a given word can either be the word vector (\(W_i = w_i\)) or the sum of the two (\(W_i = w_i + c_i\)).
HistWords is a collection of tools and datasets for Python. Its goal is to quantify semantic change by evaluating word embeddings (PPMI, SVD, word2vec).
Hamilton, Leskovec, and Jurafsky (2016) use the word vectors made with HistWords to study the semantic evolution of more than 30,000 words across 4 languages. They claim their results illustrate two statistical laws that govern the evolution of word meaning:
Law of conformity: words that are used more frequently change less and have meanings that are more stable over time.
Law of innovation: words that are polysemous change at faster rates.
This seems to work well, when applied to conveniently selected lexemes (Fig. 3.2).
Figure 3.2: 2D visualizations of semantic change in the COHA using SGNS vectors
How is such a net visualization obtained?
To compare word vectors from different periods, the vectors must be aligned to the same coordinate axes.
Explicit vectors (such as those obtained with PPMI) are naturally aligned, as each column simply corresponds to a context word.
Implicit vectors (such as those obtained with SVD or SGNS) are not be naturally aligned. For example, SGNS vectors are obtained stochastically. Each time you run SGNS, this results in arbitrary orthogonal transformations: Although this does not affect pairwise cosine-similarities within years, you cannot compare the same word across time.
To solve this, Hamilton, Leskovec, and Jurafsky (2016) use orthogonal Procrustes to align the embeddings. Procrustes is a mythological bandit from Ancient Greece who attacked people by stretching them or cutting off their legs, so as to force them to fit the size of an iron bed (Fig. 3.3).