All the downloadable datasets included in this notebook are subject to the following licence: Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0). If you reuse any of them, please cite the dataset in connection with this notebook.

Computational linguistics offers promising tools for tracking language change in diachronic corpora. These tools exploit distributional semantic models, both old and new. DSMs tend to perform well at the level of lexical semantics but are more difficult to fine-tune when it comes to capturing grammatical meaning.

I present ways in which the above can be improved. I start from well-trodden methodological paths implemented in diachronic construction grammar: changes in the collocational patterns of a linguistic unit reflect changes in meaning/function; distributional word representations can be supplemented with frequency-based methods. I move on to show that when meaning is apprehended with predictive models (e.g. word2vec), one can trace semantic shifts with greater explanatory power than with count models. Although this idea may sound outdated from the perspective of NLP, it actually goes great ways from the viewpoint of theory-informed corpus linguistics.

I illustrate the above with several case studies, one of which involves complex locative prepositions in the Corpus of Historical American English. I conclude my talk by defending the idea that NLP, with its focus on computational efficiency, and corpus-linguistics, with its focus on tools that maximize data inspection, have much to gain from getting closer.

1 Introduction

Because of the topics covered by this seminar, I have decided to focus my talk on methodological issues and share some thoughts on my practice as a corpus linguist with an NLP leaning. It is why I have decided to illustrate my talk with a notebook rather than slides.

Distributional semantic models (henceforth DSMs) are computational implementations of the distributional hypothesis: words that occur in similar contexts tend to have similar meanings (Harris 1954; Firth 1957; Miller and Charles 1991).

Initially developed in the field of cognitive psychology to model memory acquisition (Landauer and Dumais 1997; Lund and Burgess 1996), DSMs have been used extensively in NLP in the wake of Turney and Pantel (2010).

DSM is a cover term for a great number of methodologically related yet distinct approaches. Three features influence shape the kind of distributional semantic modeling that you do:

context types: document-based (e.g. LSA) vs. word-based
level of analysis: type (e.g. word2vec) vs. token (BERT)
computational representation: explicit vectors (e.g. count-based models) vs. implicit vectors (e.g. predictive models).

In any case, DSMs embrace the ‘Bag-of-Words’ approach. On the one hand, semantic modeling works well at the lexical level but not so much at more complex levels. We shall see why briefly, and I will propose a workaround based on previous works. The approach I propose taps into the computational force of NLP and the methodological intuitions of corpus linguistics. It combines the assets of collocational analysis and DSMs.

Another related issue has to do with diachronic linguistics. In quite general terms, diachronic linguistics is the study of language change. The methodological implications are not that simple because doing diachronic linguistics depends on your theory of language and on what the linguist considers is a relevant linguistic unit for the study of change (morphemes? lexemes? syntactic patterns? etc.)

I will address these issues in a hybrid manner, i.e. via a combination of theoretical reflections and practice (i.e. with R code).

After introducing the foundations of DSMs, I will move on to a review of their applications in diachrony, comparing one NLP approach to corpus-linguistic approaches. I will present two case studies: the split infinitive and the internal-location construction.

I argue for a double requirement: maximizing the quality of the vector representation, and respecting the nature of the linguistic unit.

2 DSMs 101

DSMs are used to produce semantic representations of words from co-occurrence matrices, i.e. tables of co-occurring words, with target words as rows, and their neighbors as columns.

Originally, a co-occurrence matrix is populated with frequency counts (how many times the target word and its neighbors co-occur) and each row is an array of such frequencies, also known as a vector. The semantic representation produced by DSMs is therefore numeric.

Semantic similarities are apprehended in terms of proximities and distances between word vectors.

2.1 A (too) simple example

source: “Word embeddings: the (very) basics,” in Around the Word, 25/04/2018

Suppose we have a mini corpus with 7 words:

bee,
eagle,
goose,
helicopter,
drone,
rocket, and
jet.

These words are found in and 3 contexts:

wings,
engine, and
sky.

Each word is characterized by 3 coordinates which correspond to the number of times the word is found in each context. For example, helicopter is not found in the wings context and it occurs twice and four times in the contexts engine and sky, respectively. Its coordinates are therefore (0,2,4).

It is customary to collect all coordinates in a matrix such as the one below.

> m <- matrix(c(3,0,2,3,0,3,2,0,4,0,2,4,0,3,3,0,4,2,1,1,1), nrow=7, ncol=3, byrow=T)
> rownames(m) <- c("bee", "eagle", "goose", "helicopter", "drone", "rocket", "jet")
> colnames(m) <- c("wings", "engine", "sky")
> m

           wings engine sky
bee            3      0   2
eagle          3      0   3
goose          2      0   4
helicopter     0      2   4
drone          0      3   3
rocket         0      4   2
jet            1      1   1

Each line is a vector. The vectors contained in the matrix are said to be explicit because each dimension corresponds to a well-identified context.

Most of the time, matrices of explicit vectors contain many “empty” cells, i.e. cells whose value is null. These are known as sparse matrices.

The toy matrix is deliberately simple as each vector is three-dimensional. In the real word, the matrix can easily reach several thousand lines and columns, depending on the size of the corpus.

Each word occupies a specific position in the vector space, as represented in Fig. 2.1.

Figure 2.1: A vector space of 7 words in 3 contexts

The word vector is the arrow from the point where all three axes intersect to the end point defined by the coordinates.

The presupposition underlying word embeddings is that semantic similarities are indexed on contextual affinities. For example, helicopter and drone are close because they occur in similar contexts, have similar vector profiles, and are therefore close in the vector space.

2.1.1 Similarity

Although this results in a simplistic view of meaning, a nice consequence is that vector coordinates can be used to calculate the proximities between words. This is done with cosine similarity ($cos~\theta$), i.e. the cosine of the angle between two word vectors (Fig. 2.2).

Figure 2.2: Cosine similarities between ‘helicopter’ and ‘drone,’ and between ‘drone’ and ‘rocket’

Let us see briefly how cosine similarity is measured. Let $\vec{a}$ and $\vec{b}$ denote two vectors. Cosine similarity between $\vec{a}$ and $\vec{b}$ is calculated as follows:

\[ cos~\theta = \frac{\vec{a}\cdot\vec{b}}{\|\vec{a}\|\|\vec{b}\|} \]

#install.packages("lsa")
library(lsa)

Loading required package: SnowballC

cos <- round(cosine(t(m)), 2)
cos

            bee eagle goose helicopter drone rocket  jet
bee        1.00  0.98  0.87       0.50  0.39   0.25 0.80
eagle      0.98  1.00  0.95       0.63  0.50   0.32 0.82
goose      0.87  0.95  1.00       0.80  0.63   0.40 0.77
helicopter 0.50  0.63  0.80       1.00  0.95   0.80 0.77
drone      0.39  0.50  0.63       0.95  1.00   0.95 0.82
rocket     0.25  0.32  0.40       0.80  0.95   1.00 0.77
jet        0.80  0.82  0.77       0.77  0.82   0.77 1.00

Theoretically, similarity scores range from $-1$ (complete opposition) to $1$ (identity). A score of $0$ indicates orthogonality (decorrelation). Values in between indicate intermediate degrees of similarity (between $0$ and $1$) or dissimilarity (between $0$ and $-1$). Here, the cosine similarities range from $0$ to $1$, since the word frequencies are not negative. The angle between two word-vectors is not greater than $90°$.

Because the matrix is symmetric, it is divided into two parts (two triangles) on either side of the diagonal of exact similarity (i.e. $cos~\theta = 1$) between the same words.

The largest dissimilarity is observed between bee and rocket ($cos~\theta = 0.25$). The largest similarity is observed between bee and eagle ($cos~\theta = 0.98$).

There are other (dis)similarity metrics One of them is Euclidean distance:

dist.object <- dist(m, method="euclidean", diag=T, upper=T)
dist.matrix <- as.matrix(dist.object)

We can represent the above graphically with a method known as Multidimensional Scaling (MDS). MDS is very popular because it is relatively old, versatile, and easy to understand and implement. It is a multivariate data analysis approach that is used to visualize distances in multidimensional maps (in general: two-dimensional plots).

mds <- cmdscale(dist.matrix,eig=TRUE, k=2)
x <- mds$points[,1]
y <- mds$points[,2]
plot(x, y, xlab="Dim.1", ylab="Dim.2", type="n")
text(x, y, labels = row.names(m), cex=.7)

2.2 Weighting

Of course, in their natural environment, word meanings do not let themselves be captured so easily:

the most frequent words in a corpus do not bring much information at all (esp. grammatical words, which are often filtered out)
the most frequent collocates do not always bring much information about a given target word.

We must apply some kind of weighting to enhance the contribution of the most revealing collocates. A weight is added to a collocate when its association with the target word is statistically significant.

Common weighting measures:

Dice coefficient
Log-likelihood
Mutual Information
Pointwise mutual information
Positive pointwise mutual information.

More info on this: an online tutorial by Andreas Niekler & Gregor Wiedemann

2.3 Dimensionality reduction

A matrix generates as many dimensions as it has columns. To summarize a matrix, we need a method to reduce the number of dimensions to a few. These are meaningful and can mapped onto a Euclidean space for easy visual inspection.

Several methods exist:

Multidimensional Scaling (Kruskal and Wish 1978; Venables and Ripley 2013)
Principal Component Analysis (Pearson 1901)
$t$-SNE (Maaten and Hinton 2008)
Singular Value Decomposition
…

See Chap. 10 of Desagulier (2017).

3 DSMs and diachrony

DSMs have been applied to the study of diachrony in NLP (Sagi, Kaufmann, and Clark 2009, 2011) and, more recently, corpus-based cognitive semantics and construction grammar.

Below, I illustrate two kinds of models (count vs. predictive) based on a couple of inspiring papers in NLP and corpus-based construction grammar.

3.1 Count models

DSMs that rely on count models, i.e. models whose vectors are generated fromco-occurrence counts, are common in corpus linguistics. Such vectors are:

long (with as many dimensions as there are collocates)
sparse (most of their cells are zeros).

3.1.1 Hilpert (2016)

goal: justify a constructional reading of English modals
how: builds a semantic vector space with the collocates of the most frequent verbs that occur with may in a 50-M word sample of the COCA (Davies 2008).
- the data are filtered and then arranged in a matrix in which the verb types are in the columns and their collocates in the rows
- the co-occurrence frequencies are weighted with PPMI
- the matrix is converted into a cosine distance matrix
- which is then transformed into a 2D semantic vector space with multidimensional scaling (MDS)
diachronic frequency information from the COHA is then projected onto the reference semantic vector space in the form of contour plots at regular intervals (1800s–1860s, 1870s–1920s, 1930s–1990s).

Hilpert observes that may entertains a complex network of associations with the lexical verbs that it governs and that it has shifted away from the expression deontic modal meanings towards epistemic meanings and a higher degree of informativeness (Fig. 3.1).

$A reference SVS of the 250 most frequent verbal collocates of *may* (left) and density changes in the SVS of *may*$

Figure 3.1: A reference SVS of the 250 most frequent verbal collocates of may (left) and density changes in the SVS of may

3.1.2 Perek (2016)

goal: show that the distributional semantic approach to semantic similarity can be applied successfully to syntactic productivity in diachrony
how: assess the shifting structure of the semantic domain of the hell-construction (V the hell out of NP) at four points in time in the COHA (1930s-– 1940s, 1950s–-1960s, 1970s–-1980s, and 1990s–-2000s).
- builds a reference vector space from the COCA
- the co-occurrence matrix is based on the 92 verbs that occur in the construction (in the rows) and their nominal, verbal, adjectival, and adverbial collocates (in the columns)
- the matrix is weighted with PMI and then submitted to hierarchical clustering, which highlights four consistent verb meanings (feelings and emotions, abstract actions, physical forceful actions, other physical actions).
- with MDS, the dimensionality of the matrix is reduced and the verbs are plotted onto the vector space based on how often they appear in the COHA for each period and colored according to their meaning

3.1.3 Perek (2018)

goal: investigate change in the way-construction between the 1830s and the 2000s.
how: the reference vector space is built with data from the COHA (Davies 2010).
- the co-occurrence matrix is populated with content words taken from a $\pm 2$ context window and weighted with PPMI.
- the dimensionality of the matrix is reduced with $t$-distributed Stochastic Neighbor Embedding, a.k.a. $t$-SNE (Maaten and Hinton 2008).

Perek finds that the three senses of the way-construction (path-creation, manner, and incidental-action) have gained in semantic diversity. More precisely, the schematicity of the verb slot or the motion component contributed by the construction has increased, alongside its productivity.

3.2 Predictive models

Predictive models are inspired by neural language models Collobert et al. (2011). Instead of counting how often a collocate $c$ occurs near a target word $w$, predictive models estimate the probability of finding $c$ near $w$. The resulting vectors are

relatively low-dimensional (from 50 to 1000 dimensions, generally around 300)
dense (no cells with zeros).

3.2.1 `word2vec`

Mikolov, Yih, and Zweig (2013)

3.2.1.1 CBOW

CBOW predicts a word given its context. It has been shown to outperform count models on a variety of Natural Language Processing tasks such as semantic relatedness, synonymy detection, selectional preferences, and analogy (Baroni, Dinu, and Kruszewski 2014). Levy, Goldberg, and Dagan (2015) warn that Baroni et al’s comparison is unfair and observe that PPMI and SVD perform equally well if fine-tuned with an ad-hoc combination of hyperparameters (context window, subsampling, deletion of rare words, negative sampling, context distribution smoothing, etc.).

3.2.1.2 SGNS

Hamilton, Leskovec, and Jurafsky (2016) show that Skip-Gram with Negative Sampling (SGNS) (Mikolov, Yih, and Zweig 2013), the alternative model of the word2vec toolkit, outperforms PPMI and SVD in the discovery of new shifts and the visualization of changes.

SGNS predicts a word’s context given the word itself, a task that is complementary to the one addressed by CBOW.

One feature of SGNS that is of particular interest to usage based linguists is that each word $W_i$ is represented by two short, dense vectors: a word vector $w_i$ and a context vector $c_i$. The final vector of a given word can either be the word vector ($W_i = w_i$) or the sum of the two ($W_i = w_i + c_i$).

SGNS treats each instance of $(w,c)$, i.e. a target word $w$ and a neighboring context word $c$, as a positive example.
it obtains negative samples by randomly sampling other words in the vocabulary, thus corrupting $(w,c)$.
it uses logistic regression to train a classifier on a binary prediction task (``is $c$ likely to occur near $w$?’’), which amounts to deciding which of the positive example and the negative example is more likely. It does so by:

maximizing the similarity of the $(w,c)$ pairs drawn from the positive data within a set context window, and
minimizing the similarity of the pairs drawn from the negative data.

The weights learned during the classification task are kept and used as the word vectors.

3.2.2 Hamilton, Leskovec, and Jurafsky (2016)

HistWords is a collection of tools and datasets for Python. Its goal is to quantify semantic change by evaluating word embeddings (PPMI, SVD, word2vec).

Hamilton, Leskovec, and Jurafsky (2016) use the word vectors made with HistWords to study the semantic evolution of more than 30,000 words across 4 languages. They claim their results illustrate two statistical laws that govern the evolution of word meaning:

Law of conformity: words that are used more frequently change less and have meanings that are more stable over time.

Law of innovation: words that are polysemous change at faster rates.

This seems to work well, when applied to conveniently selected lexemes (Fig. 3.2).

Figure 3.2: 2D visualizations of semantic change in the COHA using SGNS vectors

How is such a net visualization obtained?

To compare word vectors from different periods, the vectors must be aligned to the same coordinate axes.

Explicit vectors (such as those obtained with PPMI) are naturally aligned, as each column simply corresponds to a context word.

Implicit vectors (such as those obtained with SVD or SGNS) are not be naturally aligned. For example, SGNS vectors are obtained stochastically. Each time you run SGNS, this results in arbitrary orthogonal transformations: Although this does not affect pairwise cosine-similarities within years, you cannot compare the same word across time.

To solve this, Hamilton, Leskovec, and Jurafsky (2016) use orthogonal Procrustes to align the embeddings. Procrustes is a mythological bandit from Ancient Greece who attacked people by stretching them or cutting off their legs, so as to force them to fit the size of an iron bed (Fig. 3.3).

Figure 3.3: The real Procrustes

The neat visualization is obtained as follows:

the union of the target word’s $k$ nearest neighbors is found over all the relevant time-points
the $t$-SNE embedding of the neighboring words is computed on the most recent time-point
for each of the previous time-points, all embeddings are held fixed, except for the target word’s emdedding
a new t-SNE embedding is optimized only for the target word.

Does this work for all linguistic units?

$Identifying constructional shifts$

Figure 3.4: Identifying constructional shifts

DSMs are great when it comes to handling word meanings. They do not perform so well when dealing with grammatical phenomena beyond the word. We cannot blame Hamilton, Leskovec, and Jurafsky (2016). This is rather an effect of the ‘Bag-of-Words’ approach, which ignores syntactic relations.

Why not combine SGNS vectors and the workaround adopted by Hilpert (2016)?

4 Captain Kirk’s infinitive

Space: The final frontier

These are the voyages of the Starship, Enterprise

Its 5-year mission

To explore strange new worlds

To seek out new life and new civilizations

To boldly go where no man has gone before.

— Captain Kirk

sources:

The insertion of the adverb boldly between the infinitive marker to and the verb go caused quite a stir among prescriptivists at the time (ironically, the sexism of “no man” caused no such uproar). This usage is still branded as incorrect.

Arguing against splitting an infinitive makes sense in the context of Latin languages. For instance, no one would ever think of splitting the one-word infinitive verb aller ‘go’ in French into al and ler. But in a Germanic language like English, to go is not a one-word verbal unit but the pairing of a former spatial preposition and the base form of a verb separated by a space. When there is a space there is a way, and speakers might be tempted to fill in the gap (after all, language, like nature, abhors a vacuum). In fact, no descriptive grammar of English states that the adjacency between to and the base verb is obligatory.

Modern grammar textbooks such as Huddleston and Pullum’s A Student’s Introduction to English Grammar (2005) point out that “[p]hrases like to really succeed have been in use for hundreds of years.” They also claim that “in some cases placing the adjunct between to and the verb is stylistically preferable to other orderings.”

4.1 Data

The data come from the COHA consists of about 475 million word tokens and 115,000 texts. The corpus is balanced by genre across twenty decades from the 1810s to the 2000s. It is perfect to see how American English has changed in the past two centuries see this video.

split <- read.table("https://tinyurl.com/splitunsplit", header=T, sep="\t")
head(split)

##   file_id genre decade year                        matches construction
## 1    7631   fic  1810s 1818 to_to properly_rr complete_vvi        split
## 2    7168   fic  1820s 1827 to_to publicly_rr denounce_vvi        split
## 3    7168   fic  1820s 1827    to_to maturely_rr weigh_vvi        split
## 4    7168   fic  1820s 1827 to_to absolutely_rr demand_vvi        split
## 5    7168   fic  1820s 1827      to_to madly_rr breast_vvi        split
## 6    7169   fic  1820s 1827  to_to wholly_rr intercept_vvi        split
##       adverb      verb
## 1   properly  complete
## 2   publicly  denounce
## 3   maturely     weigh
## 4 absolutely    demand
## 5      madly    breast
## 6     wholly intercept

If you want to know how to make a data frame from an annotated corpus, please read Desagulier (2017, sec. 5.3) ;-)

split$construction <- as.factor(split$construction)
summary(split$construction)

##   split unsplit 
##    7196   61771

split$verb <- as.factor(split$verb)
summary(split$verb)

##         be      speak       have        act       look         go      think 
##      17925       1452       1019        877        769        731        709 
##         do        say       move       live       deal       know       talk 
##        685        681        668        657        571        559        543 
##     become        see        get       come       walk       make       feel 
##        538        498        460        396        381        361        344 
##       take understand     remain        pay      stand       keep       grow 
##        343        298        280        239        230        225        222 
##   consider       rise        sit      write     listen        use     answer 
##        217        217        216        215        208        203        200 
##     appear       fall       meet       give       pass      laugh    operate 
##        200        200        199        196        191        189        182 
##    respond  determine     follow      leave    proceed     return   increase 
##        182        180        179        178        178        178        172 
##        run        eat       read      state       work      smile       play 
##        171        167        167        164        162        161        159 
##    breathe       turn       rely     report       wait    explain   function 
##        157        156        154        152        148        147        146 
##       show     depend    examine        cry     change      judge       find 
##        145        144        144        141        139        137        136 
##     decide        die       rest    produce      stare     accept      enter 
##        134        134        130        128        128        126        126 
##    perform    develop      start      study       stay    compete      learn 
##        126        125        124        124        123        122        122 
##    mention        ask       bear      spend       tell      sleep    express 
##        118        116        116        114        113        111        110 
##      fight        buy appreciate   describe     behave     travel      apply 
##        108        106        104        103        102        100         98 
##      dress    (Other) 
##         98      25640

4.2 Distinctive verbs

I used a tweaked version of Stefan Gries’s Coll.analysis 3.5 script to select the most distinctive verbs of the split and unsplit constructions. The output is given below.

output.DCA <- read.table("https://tinyurl.com/outputDCA", header=T, sep="\t")
head(output.DCA)

##        words freq.w.split freq.with.unsplit pref.occur coll.strength
## 1 understand          116               182      split        167.21
## 2     reduce           49                33      split        118.48
## 3 appreciate           52                52      split        102.60
## 4     change           53                86      split         73.97
## 5      enjoy           40                48      split         70.27
## 6    protect           24                13      split         63.44

We filter the most distinctive verb collocates and save them.

# install.packages("tidyverse")
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

df.top.coll <- output.DCA %>%
  filter(coll.strength >= 15)
nrow(df.top.coll)

## [1] 148

words <- unique(df.top.coll$words)

4.3 Obtaining the vectors

We load the SNGS vectors made from the COHA (1 GB!). I used a version of SGNS that is part of Levy, Goldberg, and Dagan (2015)’s hyperwords script collection for Python.¹ This version allows the user to tune hyperparameters

To train SGNS on the whole COHA (1810s–2000s), I set the hyperparameters as follows:

the original contexts’ distribution is smoothed by raising all context counts to the power of 0.75;²
the context window is set to $\pm5$;³
negative samples are set to 15;⁴
each vector has 300 dimensions;
all words whose frequency is less than 5 in the COHA are discarded.

# load all vectors (long)
all.vectors <- read.csv("https://tinyurl.com/cohasgns", header=F, sep=" ") # careful, this may bottleneck your cpu
colnames(all.vectors) <- c("words", paste("V", seq(1, 300,1), sep=""))

4.4 The reference SVS

We subset the vectors of the most distinctive verbs.

# load verbs
verbs.df <- as.data.frame(verbs)

# left join
#install.packages("tidyverse")
library(tidyverse)
input.tsne <- left_join(verbs.df, all.vectors, by = "words")

We obtain the following:

input.tsne <- read.table("https://tinyurl.com/inputtsne", header=T, sep="\t")
input.tsne[1:5,1:5]

##        verbs           V1           V2           V3          V4
## 1 understand -0.015784338 -0.007878140 -0.040333690  0.03372044
## 2     reduce -0.067971590 -0.038886357  0.105956880 -0.03643693
## 3 appreciate  0.009629898 -0.043391414 -0.007136557  0.06818054
## 4     change  0.107165165  0.041376587 -0.119515960 -0.01054170
## 5      enjoy  0.030036286 -0.009865477 -0.035851136  0.14671306

We apply dimensionality reduction with $t$-SNE.

# install.packages("Rtsne")
library(Rtsne)
num_dims <- 300
set.seed(7115)
rtsne_out <- Rtsne(input.tsne[,2:301], initial_dims=num_dims, max_iter=5000, perplexity=36)

We save the coordinates for the 2D plot from the output for later.

df.tsne <- as.data.frame(rtsne_out$Y)
colnames(df.tsne) <- c("dim1", "dim2")
df.tsne$verb <- input.tsne$verb
head(df.tsne)

##          dim1      dim2       verb
## 1  0.59654580  3.084163 understand
## 2 -0.51275635 -3.740681     reduce
## 3  0.72215620  2.885243 appreciate
## 4  0.92457363 -5.032027     change
## 5 -0.06972965  2.450023      enjoy
## 6  0.57469809 -1.371691    protect

We plot the reference SVS.

mar.default <- c(5,2,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0)) 
plot(jitter(rtsne_out$Y), t='n', main="", xlab="dimension 1", ylab="dimension 2", cex.lab=1.8, cex.axis=1.8, cex.main=1.8, cex.sub=1.8)
text(jitter(rtsne_out$Y), labels=input.tsne$verbs, cex=1.8)

Figure 4.1: The reference semantic vector space

4.5 Contour plots

We group decades in the original data frame into four periods:

period 1: 1810s–1860s
period 2: 1870s–1920s
period 3: 1930s–1960s
period 4: 1970s–2000s

The relevant periods can be determined empirically, as shown in this post: A data-driven approach to identifying development stages in diachronic corpus linguistics.

split$decade <- as.factor(gsub("1810s|1820s|1830s|1840s|1850s|1860s", "period_1", split$decade))
split$decade <- as.factor(gsub("1870s|1880s|1890s|1900s|1910s|1920s", "period_2", split$decade))
split$decade <- as.factor(gsub("1930s|1930s|1940s|1950s|1960s", "period_3", split$decade))
split$decade <- as.factor(gsub("1970s|1980s|1990s|2000s", "period_4", split$decade))
levels(split$decade)

## [1] "period_1" "period_2" "period_3" "period_4"

For each period, we calculate the co-occurrence frequencies between the constructions and the verbs and filter out hapax legomena.

split.periodized <- split %>%
  group_by(decade) %>%
  count(construction, verb) %>%
  filter(n > 1)
head(split.periodized)

## # A tibble: 6 × 4
## # Groups:   decade [1]
##   decade   construction verb           n
##   <fct>    <fct>        <fct>      <int>
## 1 period_1 split        abandon        2
## 2 period_1 split        affect         3
## 3 period_1 split        aid            2
## 4 period_1 split        appreciate     2
## 5 period_1 split        attend         2
## 6 period_1 split        avoid          2

There is a lot of variation among the frequency values. We apply a binary-logarithm transformation and store the result in a new column.

split.periodized$density <- log2(split.periodized$n)
head(split.periodized)

## # A tibble: 6 × 5
## # Groups:   decade [1]
##   decade   construction verb           n density
##   <fct>    <fct>        <fct>      <int>   <dbl>
## 1 period_1 split        abandon        2    1   
## 2 period_1 split        affect         3    1.58
## 3 period_1 split        aid            2    1   
## 4 period_1 split        appreciate     2    1   
## 5 period_1 split        attend         2    1   
## 6 period_1 split        avoid          2    1

Finally, we add the $t$-SNE output and drop NAs.

df.periodized <- left_join(split.periodized, df.tsne, by="verb")
nrow(df.periodized)

## [1] 4618

df.periodized <- df.periodized %>% drop_na()
head(df.periodized)

## # A tibble: 6 × 7
## # Groups:   decade [1]
##   decade   construction verb           n density   dim1   dim2
##   <fct>    <fct>        <chr>      <int>   <dbl>  <dbl>  <dbl>
## 1 period_1 split        abandon        2    1    -0.327  0.284
## 2 period_1 split        affect         3    1.58  0.958 -4.54 
## 3 period_1 split        appreciate     2    1     0.722  2.89 
## 4 period_1 split        become         3    1.58  3.29   3.86 
## 5 period_1 split        believe        2    1    -0.177  3.43 
## 6 period_1 split        break          2    1    -4.63  -0.306

4.5.1 Period 1

We select the split data from the first period.

df.period.1 <- df.periodized %>%
  filter(decade == "period_1") %>%
  filter(construction == "split")
df.period.1 <- df.period.1[,c(6,7,5)]
head(df.period.1)

## # A tibble: 6 × 3
##     dim1   dim2 density
##    <dbl>  <dbl>   <dbl>
## 1 -0.327  0.284    1   
## 2  0.958 -4.54     1.58
## 3  0.722  2.89     1   
## 4  3.29   3.86     1.58
## 5 -0.177  3.43     1   
## 6 -4.63  -0.306    1

We re-plot the reference SVS and make the contour plots indexed on the per-period logged frequencies with the kde2d() function from the MASSpackage.

postscript("~/path.to.your.folder/density.plot.split.period.1.eps", horizontal = FALSE, onefile = FALSE, paper = "special", height=18, width=24)

mar.default <- c(5,2,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0)) 
plot(jitter(rtsne_out$Y), t='n', main="", xlab="dimension 1", ylab="dimension 2", cex.lab=1.8, cex.axis=1.8, cex.main=1.8, cex.sub=1.8)
text(jitter(rtsne_out$Y), labels=input.tsne$verbs, cex=1.8)

# install.packages("MASS")
library(MASS)
df.period.1 <- as.data.frame(df.period.1)

bivn.kde.split.period.1 <- kde2d(df.period.1[,1], df.period.1[,2], h=df.period.1[,3], n = 500, lims = c(range(df.periodized$dim1), range(df.periodized$dim2)))
contour(bivn.kde.split.period.1, add = TRUE, col="dodgerblue", lwd=2)

dev.off()

We obtain the following plot.

Figure 4.2: SVS with contour plots (period 1)

We repeat the above for the remaining three periods.

4.5.2 Period 2

df.period.2 <- df.periodized %>%
  filter(decade == "period_2")
df.period.2 <- df.period.2[,c(6,7,5)]
head(df.period.2)

## # A tibble: 6 × 3
##     dim1   dim2 density
##    <dbl>  <dbl>   <dbl>
## 1 -0.327  0.284    1.58
## 2  0.958 -4.54     3   
## 3  0.985 -4.81     1   
## 4  0.722  2.89     4.52
## 5 -3.31   3.57     3.32
## 6  3.29   3.86     1

postscript("~/path.to.your.folder/density.plot.split.period.2.eps", horizontal = FALSE, onefile = FALSE, paper = "special", height=18, width=24)

mar.default <- c(5,2,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0)) 
plot(jitter(rtsne_out$Y), t='n', main="", xlab="dimension 1", ylab="dimension 2", cex.lab=1.8, cex.axis=1.8, cex.main=1.8, cex.sub=1.8)
text(jitter(rtsne_out$Y), labels=input.tsne$verbs, cex=1.8)

library(MASS)
df.period.2 <- as.data.frame(df.period.2)

bivn.kde.split.period.2 <- kde2d(df.period.2[,1], df.period.2[,2], h=df.period.2[,3], n = 500, lims = c(range(df.periodized$dim1), range(df.periodized$dim2)))
contour(bivn.kde.split.period.2, add = TRUE, col="dodgerblue", lwd=2)

dev.off()

Figure 4.3: SVS with contour plots (period 2)

4.5.3 Period 3

df.period.3 <- df.periodized %>%
  filter(decade == "period_3")
df.period.3 <- df.period.3[,c(6,7,5)]
head(df.period.3)

## # A tibble: 6 × 3
##     dim1   dim2 density
##    <dbl>  <dbl>   <dbl>
## 1  1.68   0.624    1   
## 2  0.958 -4.54     2   
## 3  0.985 -4.81     1   
## 4  0.722  2.89     2   
## 5  2.60   2.34     1.58
## 6 -3.31   3.57     2.58

postscript("~/path.to.your.folder/density.plot.split.period.3.eps", horizontal = FALSE, onefile = FALSE, paper = "special", height=18, width=24)

mar.default <- c(5,2,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0)) 
plot(jitter(rtsne_out$Y), t='n', main="", xlab="dimension 1", ylab="dimension 2", cex.lab=1.8, cex.axis=1.8, cex.main=1.8, cex.sub=1.8)
text(jitter(rtsne_out$Y), labels=input.tsne$verbs, cex=1.8)

library(MASS)
df.period.3 <- as.data.frame(df.period.3)

bivn.kde.split.period.3 <- kde2d(df.period.3[,1], df.period.3[,2], h=df.period.3[,3], n = 500, lims = c(range(df.periodized$dim1), range(df.periodized$dim2)))
contour(bivn.kde.split.period.3, add = TRUE, col="dodgerblue", lwd=2)

dev.off()

Figure 4.4: SVS with contour plots (period 3)

4.5.4 Period 4

df.period.4 <- df.periodized %>%
  filter(decade == "period_4")
df.period.4 <- df.period.4[,c(6,7,5)]
head(df.period.4)

## # A tibble: 6 × 3
##     dim1   dim2 density
##    <dbl>  <dbl>   <dbl>
## 1 -0.327  0.284    2   
## 2  1.82  -3.80     2.81
## 3  1.68   0.624    1.58
## 4  4.88   0.876    1   
## 5 -3.66   5.81     4.52
## 6  0.958 -4.54     3.58

postscript("~/path.to.your.folder/density.plot.split.period.4.eps", horizontal = FALSE, onefile = FALSE, paper = "special", height=18, width=24)

mar.default <- c(5,2,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0)) 
plot(jitter(rtsne_out$Y), t='n', main="", xlab="dimension 1", ylab="dimension 2", cex.lab=1.8, cex.axis=1.8, cex.main=1.8, cex.sub=1.8)
text(jitter(rtsne_out$Y), labels=input.tsne$verbs, cex=1.8)

library(MASS)
df.period.4 <- as.data.frame(df.period.4)

bivn.kde.split.period.4 <- kde2d(df.period.4[,1], df.period.4[,2], h=df.period.4[,3], n = 500, lims = c(range(df.periodized$dim1), range(df.periodized$dim2)))
contour(bivn.kde.split.period.4, add = TRUE, col="dodgerblue", lwd=2)

dev.off()

Figure 4.5: SVS with contour plots (period 4)

4.5.5 Summary

Figure 4.6: SVS with contour plots (all 4 periods)

Figure 4.7: Distribution of the split (blue) and unsplit (orange) infinitives in the COHA

5 The IL construction

The IL construction subsumes four prepositional patterns that denote a relation of internal location between a located entity and a reference entity: in/at the middle of NP, in/at the center of NP, in/at the heart of NP, and in the midst of NP.

He stops suddenly in the middle of the stage and seems to consider. (1815-FIC-FalseShame)
Marvin walked to the chalk mark in the center of the ring. (1934-FIC-Captain Caution)
(…) we were in the heart of Norwalk. (1827-FIC-Novels)
We see St Eustace praying in the midst of the river. (1980-FIC-RiddleyWalker)

Figure 5.1: The IL construction network

5.1 Data

Figure 5.2: The IL construction network

period 1: 1810s–1860s
period 2: 1870s–1910s
period 3: 1920s–1970s
period 4: 1980s–2000s

5.2 Research questions

What is the semantic profile of each pattern?
How has this semantic profile shifted over time?
What does it tell us as to the internal dynamics of the IL construction?

5.3 The reference SVS

Figure 5.3: The reference SVS of the IL construction in the COHA

5.4 Contour plots

Figure 5.4: Diachronic distributional semantic plots of Prep the midst of NP and Prep the heart of NP

Figure 5.5: Diachronic distributional semantic plots of Prep the middle of NP and Prep the center of NP

5.5 Summary

Two kinds of horizontal relations emerge:

increasing competition (midst & heart)
co-existence + division of labor (center & middle)

6 Conclusion

The statistically-computed semantic vector spaces provided by DSMs seem to provide psychologically relevant representations of word meaning.⁵

It could be argued that word2vec suffers from the ‘one vector per word’ issue (Desagulier 2019) because it produces type-based distributional representations, i.e. an aggregation of all the contexts encountered by a single word form.

In the case of homonymy, this is clearly a problem because the resulting vector conflates contexts that may have nothing to do with each other: e.g. bat as the stick used in baseball vs. the winged mammal. Very few such cases were found in our dataset.

In the case of polysemy, some linguists might be wary of the quality of the resulting vectors and turn to token-based distributional representations instead (Fonteyn 2021).

In theory, token-based DSMs such as BERT are optimal: token-based models do provide as many representations as there are contexts and do reflect semantic complexity. In practice, however, this comes at a huge computing cost. For this reason, most BERT models can only be trained on relatively small-sized corpora, unless one has access to a supercomputer.⁶

In the context of my research, and given the trade-off between computational cost and the semantic generalization requirement, word2vec is considered a reasonable alternative given that due attention was paid to the hyperparameter setup.

Another aspect that makes word2vec a method still worth considering, despite the undeniable progress of token-based DSMs, is that it is not as much of a ‘black box’ as state-of-the-art token-based DSMs. It is possible to keep track of what the algorithm does at any stage. Conversely, most BERT distributions only offer access to last two final hidden layers. When access is granted, there is still discussion as to what layer is relevant to the semanticist (Mun 2021).

Finally, one may consider that the aggregated approach to meaning representation offered by type-based DSMs captures the core meaning of any given word type (Erk and Padó 2010).

All in all, the consistency of the above vector spaces suggests that ‘old-school’ predictive models have not said their last word.

7 References

Baroni, Marco, Georgiana Dinu, and Germán Kruszewski. 2014. “Don’t Count, Predict! A Systematic Comparison of Context-Counting Vs. Context-Predicting Semantic Vectors.” In ACL (1), 238–47.

Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. “A Neural Probabilistic Language Model.” Journal of Machine Learning Research 3: 1137–55.

Bullinaria, John A, and Joseph P Levy. 2007. “Extracting Semantic Representations from Word Co-Occurrence Statistics: A Computational Study.” Behavior Research Methods 39 (3): 510–26.

Collobert, Ronan, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. “Natural Language Processing (Almost) from Scratch.” Journal of Machine Learning Research 12: 2493–2537.

Davies, Mark. 2008. “The Corpus of Contemporary American English (COCA).” 2008. https://www.english-corpora.org/coca/.

———. 2010. “The Corpus of Historical American English (COHA).” 2010. https://www.english-corpora.org/coha/.

Deerwester, Scott, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41 (6): 391–407.

Desagulier, Guillaume. 2017. Corpus linguistics and Statistics with R: Introduction to Quantitative Methods in Linguistics. New York: Springer.

———. 2019. “Can Word Vectors Help Corpus Linguists?” Studia Neophilologica 91 (2): 219–40.

Erk, Katrin, and Sebastian Padó. 2010. “Exemplar-Based Models for Word Meaning in Context.” In Proceedings of the Acl 2010 Conference Short Papers, 92–97.

Firth, J. R. 1957. “A Synopsis of Linguistic Theory 1930-55.” In Studies in Linguistic Analysis (special volume of the Philological Society), 1952-59:1–32. Oxford: The Philological Society.

Fonteyn, Lauren. 2021. “Varying Abstractions: A Conceptual Vs. Distributional View on Prepositional Polysemy.” Glossa: A Journal of General Linguistics 6 (1).

French, Robert M, and Christophe Labiouse. 2002. “Four Problems with Extracting Human Semantics from Large Text Corpora.” In Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society, 316–21. Routledge.

Glenberg, Arthur M, and David A Robertson. 2000. “Symbol Grounding and Meaning: A Comparison of High-Dimensional and Embodied Theories of Meaning.” Journal of Memory and Language 43 (3): 379–401.

Hamilton, William L, Jure Leskovec, and Dan Jurafsky. 2016. “Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1489–1501.

Harris, Zellig S. 1954. “Distributional Structure.” Word 10 (2-3): 146–62.

Hilpert, Martin. 2016. “Change in Modal Meanings.” Constructions and Frames 8 (1): 66–85.

Kruskal, Joseph B., and Myron Wish. 1978. “Multidimensional Scaling.” In Sage University Paper Series on Quantitative Applications in the Social Sciences. 07–011.

Landauer, Thomas K, and Susan T Dumais. 1997. “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological Review 104 (2): 211.

Levy, Omer, Yoav Goldberg, and Ido Dagan. 2015. “Improving Distributional Similarity with Lessons Learned from Word Embeddings.” Transactions of the Association for Computational Linguistics 3: 211–25.

Lund, Kevin, and Curt Burgess. 1996. “Producing High-Dimensional Semantic Spaces from Lexical Co-Occurrence.” Behavior Research Methods, Instruments, & Computers 28 (2): 203–8.

Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.” Journal of Machine Learning Research 9: 2579–2605.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” CoRR abs/1301.3781. http://arxiv.org/abs/1301.3781.

Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. 2013. “Linguistic Regularities in Continuous Space Word Representations.” In Proceedings of NAACL-HLT, 746–51. http://www.aclweb.org/anthology/N/N13/N13-1090.pdf.

Miller, George A., and Walter G. Charles. 1991. “Contextual Correlates of Semantic Similarity.” Language and Cognitive Processes 6 (1): 1–28.

Mun, Seongmin. 2021. “Polysemy Resolution with Word Embedding Models and Data Visualization: The Case of Adverbial Postpositions -Ey, -Eyse, and -(u)lo in Korean.” PhD thesis, Paris Nanterre University: École Doctorale Connaissance, Langage et Modélisation.

Pearson, Karl. 1901. “LIII. On Lines and Planes of Closest Fit to Systems of Points in Space.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11): 559–72.

Perek, Florent. 2016. “Using Distributional Semantics to Study Syntactic Productivity in Diachrony: A Case Study.” Linguistics 54 (1): 149–88.

———. 2018. “Recent Change in the Productivity and Schematicity of the Way-Construction: A Distributional Semantic Analysis.” Corpus Linguistics and Linguistic Theory 14 (1): 65–97.

Sagi, Eyal, Stefan Kaufmann, and Brady Clark. 2009. “Semantic Density Analysis: Comparing Word Meaning Across Time and Phonetic Space.” In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, 104–11.

———. 2011. “Tracing Semantic Change with Latent Semantic Analysis.” Current Methods in Historical Semantics, 161–83.

Turney, Peter D, and Patrick Pantel. 2010. “From Frequency to Meaning: Vector Space Models of Semantics.” Journal of Artificial Intelligence Research 37: 141–88.

Venables, William N, and Brian D Ripley. 2013. Modern Applied Statistics with s-PLUS. Springer Science & Business Media.

https://github.com/williamleif/histwords/tree/master/sgns/hyperwords ↩︎
This increases the probability of rare noise words slightly and improves the quality of the vector representations.↩︎
show that window size influences the vector representations significantly. Shorter context windows (e.g. $\pm 2$) tend to produce vector representations that foreground syntactic similarities, i.e. words that belong to the same parts of speech. Longer context windows ($\geq \pm 5$), on the other hand, tend to produce vector representations that foreground topical relations.↩︎
SGNS uses more negative examples than positive examples and therefore produces better representations when negative samples are greater than one. By setting negative samples to 15, we have 15 negative examples in the negative training set for each positive example.↩︎
This is at least the claimed contribution of two landmark DSMs: Hyperspace Approximation to Language (Lund and Burgess 1996) and Latent Semantic Analysis (Deerwester et al. 1990; Landauer and Dumais 1997). These frameworks do instantiate how co-occurrence patterns can simulate psychological tasks at the lexical semantic level, once post-processed with rather simple statistics. Bullinaria and Levy (2007) list studies that challenge this approach (Glenberg and Robertson 2000; see also French and Labiouse 2002). Their main argument is that theories based on word co-occurrence are symbolic and cannot hope to make contact with `real-world’ semantics.↩︎
For example, Fonteyn (2021) trains the base BERT model on a small portion of the COHA to create contextualized embeddings for over.↩︎

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Doing diachronic linguistics with distributional semantic models in

Guillaume Desagulier

LADAL Opening Webinar Series 2021 – Sept. 30, 2021

1 Introduction

2 DSMs 101

2.1 A (too) simple example

2.1.1 Similarity

2.2 Weighting

2.3 Dimensionality reduction

3 DSMs and diachrony

3.1 Count models

3.1.1 Hilpert (2016)

3.1.2 Perek (2016)

3.1.3 Perek (2018)

3.2 Predictive models

3.2.1 `word2vec`

3.2.1.1 CBOW

3.2.1.2 SGNS

3.2.2 Hamilton, Leskovec, and Jurafsky (2016)

4 Captain Kirk’s infinitive

4.1 Data

4.2 Distinctive verbs

4.3 Obtaining the vectors

4.4 The reference SVS

4.5 Contour plots

4.5.1 Period 1

4.5.2 Period 2

4.5.3 Period 3

4.5.4 Period 4

4.5.5 Summary

5 The IL construction

5.1 Data

5.2 Research questions

5.3 The reference SVS

5.4 Contour plots

5.5 Summary

6 Conclusion

7 References

Doing diachronic linguistics with distributional semantic models in

Guillaume Desagulier

LADAL Opening Webinar Series 2021 – Sept. 30, 2021

1 Introduction

2 DSMs 101

2.1 A (too) simple example

2.1.1 Similarity

2.2 Weighting

2.3 Dimensionality reduction

3 DSMs and diachrony

3.1 Count models

3.1.1 Hilpert (2016)

3.1.2 Perek (2016)

3.1.3 Perek (2018)

3.2 Predictive models

3.2.1 word2vec

3.2.1.1 CBOW

3.2.1.2 SGNS

3.2.2 Hamilton, Leskovec, and Jurafsky (2016)

4 Captain Kirk’s infinitive

4.1 Data

4.2 Distinctive verbs

4.3 Obtaining the vectors

4.4 The reference SVS

4.5 Contour plots

4.5.1 Period 1

4.5.2 Period 2

4.5.3 Period 3

4.5.4 Period 4

4.5.5 Summary

5 The IL construction

5.1 Data

5.2 Research questions

5.3 The reference SVS

5.4 Contour plots

5.5 Summary

6 Conclusion

7 References

3.2.1 `word2vec`