All the downloadable datasets included in this notebook are subject to the following licence: Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0). If you reuse any of them, please cite the dataset in connection with this notebook.
Computational linguistics offers promising tools for tracking language change in diachronic corpora. These tools exploit distributional semantic models, both old and new. DSMs tend to perform well at the level of lexical semantics but are more difficult to fine-tune when it comes to capturing grammatical meaning.
I present ways in which the above can be improved. I start from well-trodden methodological paths implemented in diachronic construction grammar: changes in the collocational patterns of a linguistic unit reflect changes in meaning/function; distributional word representations can be supplemented with frequency-based methods. I move on to show that when meaning is apprehended with predictive models (e.g. word2vec), one can trace semantic shifts with greater explanatory power than with count models. Although this idea may sound outdated from the perspective of NLP, it actually goes great ways from the viewpoint of theory-informed corpus linguistics.
I illustrate the above with several case studies, one of which involves complex locative prepositions in the Corpus of Historical American English. I conclude my talk by defending the idea that NLP, with its focus on computational efficiency, and corpus-linguistics, with its focus on tools that maximize data inspection, have much to gain from getting closer.
Because of the topics covered by this seminar, I have decided to focus my talk on methodological issues and share some thoughts on my practice as a corpus linguist with an NLP leaning. It is why I have decided to illustrate my talk with a notebook rather than slides.
Distributional semantic models (henceforth DSMs) are computational implementations of the distributional hypothesis: words that occur in similar contexts tend to have similar meanings (Harris 1954; Firth 1957; Miller and Charles 1991).
Initially developed in the field of cognitive psychology to model memory acquisition (Landauer and Dumais 1997; Lund and Burgess 1996), DSMs have been used extensively in NLP in the wake of Turney and Pantel (2010).
DSM is a cover term for a great number of methodologically related yet distinct approaches. Three features influence shape the kind of distributional semantic modeling that you do:
word2vec
) vs. token (BERT)In any case, DSMs embrace the ‘Bag-of-Words’ approach. On the one hand, semantic modeling works well at the lexical level but not so much at more complex levels. We shall see why briefly, and I will propose a workaround based on previous works. The approach I propose taps into the computational force of NLP and the methodological intuitions of corpus linguistics. It combines the assets of collocational analysis and DSMs.
Another related issue has to do with diachronic linguistics. In quite general terms, diachronic linguistics is the study of language change. The methodological implications are not that simple because doing diachronic linguistics depends on your theory of language and on what the linguist considers is a relevant linguistic unit for the study of change (morphemes? lexemes? syntactic patterns? etc.)
I will address these issues in a hybrid manner, i.e. via a combination of theoretical reflections and practice (i.e. with R code).
After introducing the foundations of DSMs, I will move on to a review of their applications in diachrony, comparing one NLP approach to corpus-linguistic approaches. I will present two case studies: the split infinitive and the internal-location construction.
I argue for a double requirement: maximizing the quality of the vector representation, and respecting the nature of the linguistic unit.
DSMs are used to produce semantic representations of words from co-occurrence matrices, i.e. tables of co-occurring words, with target words as rows, and their neighbors as columns.
Originally, a co-occurrence matrix is populated with frequency counts (how many times the target word and its neighbors co-occur) and each row is an array of such frequencies, also known as a vector. The semantic representation produced by DSMs is therefore numeric.
Semantic similarities are apprehended in terms of proximities and distances between word vectors.
source: “Word embeddings: the (very) basics,” in Around the Word, 25/04/2018
Suppose we have a mini corpus with 7 words:
These words are found in and 3 contexts:
Each word is characterized by 3 coordinates which correspond to the number of times the word is found in each context. For example, helicopter is not found in the wings context and it occurs twice and four times in the contexts engine and sky, respectively. Its coordinates are therefore (0,2,4).
It is customary to collect all coordinates in a matrix such as the one below.
> m <- matrix(c(3,0,2,3,0,3,2,0,4,0,2,4,0,3,3,0,4,2,1,1,1), nrow=7, ncol=3, byrow=T)
> rownames(m) <- c("bee", "eagle", "goose", "helicopter", "drone", "rocket", "jet")
> colnames(m) <- c("wings", "engine", "sky")
> m
wings engine sky
bee 3 0 2
eagle 3 0 3
goose 2 0 4
helicopter 0 2 4
drone 0 3 3
rocket 0 4 2
jet 1 1 1
Each line is a vector. The vectors contained in the matrix are said to be explicit because each dimension corresponds to a well-identified context.
Most of the time, matrices of explicit vectors contain many “empty” cells, i.e. cells whose value is null. These are known as sparse matrices.
The toy matrix is deliberately simple as each vector is three-dimensional. In the real word, the matrix can easily reach several thousand lines and columns, depending on the size of the corpus.
Each word occupies a specific position in the vector space, as represented in Fig. 2.1.
The word vector is the arrow from the point where all three axes intersect to the end point defined by the coordinates.
The presupposition underlying word embeddings is that semantic similarities are indexed on contextual affinities. For example, helicopter and drone are close because they occur in similar contexts, have similar vector profiles, and are therefore close in the vector space.
Although this results in a simplistic view of meaning, a nice consequence is that vector coordinates can be used to calculate the proximities between words. This is done with cosine similarity (\(cos~\theta\)), i.e. the cosine of the angle between two word vectors (Fig. 2.2).
Let us see briefly how cosine similarity is measured. Let \(\vec{a}\) and \(\vec{b}\) denote two vectors. Cosine similarity between \(\vec{a}\) and \(\vec{b}\) is calculated as follows:
\[ cos~\theta = \frac{\vec{a}\cdot\vec{b}}{\|\vec{a}\|\|\vec{b}\|} \]
#install.packages("lsa")
library(lsa)
Loading required package: SnowballC
cos <- round(cosine(t(m)), 2)
cos
bee eagle goose helicopter drone rocket jet
bee 1.00 0.98 0.87 0.50 0.39 0.25 0.80
eagle 0.98 1.00 0.95 0.63 0.50 0.32 0.82
goose 0.87 0.95 1.00 0.80 0.63 0.40 0.77
helicopter 0.50 0.63 0.80 1.00 0.95 0.80 0.77
drone 0.39 0.50 0.63 0.95 1.00 0.95 0.82
rocket 0.25 0.32 0.40 0.80 0.95 1.00 0.77
jet 0.80 0.82 0.77 0.77 0.82 0.77 1.00
Theoretically, similarity scores range from \(-1\) (complete opposition) to \(1\) (identity). A score of \(0\) indicates orthogonality (decorrelation). Values in between indicate intermediate degrees of similarity (between \(0\) and \(1\)) or dissimilarity (between \(0\) and \(-1\)). Here, the cosine similarities range from \(0\) to \(1\), since the word frequencies are not negative. The angle between two word-vectors is not greater than \(90°\).
Because the matrix is symmetric, it is divided into two parts (two triangles) on either side of the diagonal of exact similarity (i.e. \(cos~\theta = 1\)) between the same words.
The largest dissimilarity is observed between bee and rocket (\(cos~\theta = 0.25\)). The largest similarity is observed between bee and eagle (\(cos~\theta = 0.98\)).
There are other (dis)similarity metrics One of them is Euclidean distance:
dist.object <- dist(m, method="euclidean", diag=T, upper=T)
dist.matrix <- as.matrix(dist.object)
We can represent the above graphically with a method known as Multidimensional Scaling (MDS). MDS is very popular because it is relatively old, versatile, and easy to understand and implement. It is a multivariate data analysis approach that is used to visualize distances in multidimensional maps (in general: two-dimensional plots).
mds <- cmdscale(dist.matrix,eig=TRUE, k=2)
x <- mds$points[,1]
y <- mds$points[,2]
plot(x, y, xlab="Dim.1", ylab="Dim.2", type="n")
text(x, y, labels = row.names(m), cex=.7)
Of course, in their natural environment, word meanings do not let themselves be captured so easily:
We must apply some kind of weighting to enhance the contribution of the most revealing collocates. A weight is added to a collocate when its association with the target word is statistically significant.
Common weighting measures:
More info on this: an online tutorial by Andreas Niekler & Gregor Wiedemann
A matrix generates as many dimensions as it has columns. To summarize a matrix, we need a method to reduce the number of dimensions to a few. These are meaningful and can mapped onto a Euclidean space for easy visual inspection.
Several methods exist:
See Chap. 10 of Desagulier (2017).
DSMs have been applied to the study of diachrony in NLP (Sagi, Kaufmann, and Clark 2009, 2011) and, more recently, corpus-based cognitive semantics and construction grammar.
Below, I illustrate two kinds of models (count vs. predictive) based on a couple of inspiring papers in NLP and corpus-based construction grammar.
DSMs that rely on count models, i.e. models whose vectors are generated fromco-occurrence counts, are common in corpus linguistics. Such vectors are:
Hilpert observes that may entertains a complex network of associations with the lexical verbs that it governs and that it has shifted away from the expression deontic modal meanings towards epistemic meanings and a higher degree of informativeness (Fig. 3.1).
Perek finds that the three senses of the way-construction (path-creation, manner, and incidental-action) have gained in semantic diversity. More precisely, the schematicity of the verb slot or the motion component contributed by the construction has increased, alongside its productivity.
Predictive models are inspired by neural language models Collobert et al. (2011). Instead of counting how often a collocate \(c\) occurs near a target word \(w\), predictive models estimate the probability of finding \(c\) near \(w\). The resulting vectors are
word2vec
Mikolov, Yih, and Zweig (2013)
CBOW predicts a word given its context. It has been shown to outperform count models on a variety of Natural Language Processing tasks such as semantic relatedness, synonymy detection, selectional preferences, and analogy (Baroni, Dinu, and Kruszewski 2014). Levy, Goldberg, and Dagan (2015) warn that Baroni et al’s comparison is unfair and observe that PPMI and SVD perform equally well if fine-tuned with an ad-hoc combination of hyperparameters (context window, subsampling, deletion of rare words, negative sampling, context distribution smoothing, etc.).
Hamilton, Leskovec, and Jurafsky (2016) show that Skip-Gram with Negative Sampling (SGNS) (Mikolov, Yih, and Zweig 2013), the alternative model of the word2vec
toolkit, outperforms PPMI and SVD in the discovery of new shifts and the visualization of changes.
SGNS predicts a word’s context given the word itself, a task that is complementary to the one addressed by CBOW.
One feature of SGNS that is of particular interest to usage based linguists is that each word \(W_i\) is represented by two short, dense vectors: a word vector \(w_i\) and a context vector \(c_i\). The final vector of a given word can either be the word vector (\(W_i = w_i\)) or the sum of the two (\(W_i = w_i + c_i\)).
HistWords is a collection of tools and datasets for Python. Its goal is to quantify semantic change by evaluating word embeddings (PPMI, SVD, word2vec
).
Hamilton, Leskovec, and Jurafsky (2016) use the word vectors made with HistWords to study the semantic evolution of more than 30,000 words across 4 languages. They claim their results illustrate two statistical laws that govern the evolution of word meaning:
Law of conformity: words that are used more frequently change less and have meanings that are more stable over time.
Law of innovation: words that are polysemous change at faster rates.
This seems to work well, when applied to conveniently selected lexemes (Fig. 3.2).
How is such a net visualization obtained?
To compare word vectors from different periods, the vectors must be aligned to the same coordinate axes.
Explicit vectors (such as those obtained with PPMI) are naturally aligned, as each column simply corresponds to a context word.
Implicit vectors (such as those obtained with SVD or SGNS) are not be naturally aligned. For example, SGNS vectors are obtained stochastically. Each time you run SGNS, this results in arbitrary orthogonal transformations: Although this does not affect pairwise cosine-similarities within years, you cannot compare the same word across time.
To solve this, Hamilton, Leskovec, and Jurafsky (2016) use orthogonal Procrustes to align the embeddings. Procrustes is a mythological bandit from Ancient Greece who attacked people by stretching them or cutting off their legs, so as to force them to fit the size of an iron bed (Fig. 3.3).
The neat visualization is obtained as follows:
Does this work for all linguistic units?
DSMs are great when it comes to handling word meanings. They do not perform so well when dealing with grammatical phenomena beyond the word. We cannot blame Hamilton, Leskovec, and Jurafsky (2016). This is rather an effect of the ‘Bag-of-Words’ approach, which ignores syntactic relations.
Why not combine SGNS vectors and the workaround adopted by Hilpert (2016)?
Space: The final frontier
These are the voyages of the Starship, Enterprise
Its 5-year mission
To explore strange new worlds
To seek out new life and new civilizations
To boldly go where no man has gone before.
sources:
The insertion of the adverb boldly between the infinitive marker to and the verb go caused quite a stir among prescriptivists at the time (ironically, the sexism of “no man” caused no such uproar). This usage is still branded as incorrect.
Arguing against splitting an infinitive makes sense in the context of Latin languages. For instance, no one would ever think of splitting the one-word infinitive verb aller ‘go’ in French into al and ler. But in a Germanic language like English, to go is not a one-word verbal unit but the pairing of a former spatial preposition and the base form of a verb separated by a space. When there is a space there is a way, and speakers might be tempted to fill in the gap (after all, language, like nature, abhors a vacuum). In fact, no descriptive grammar of English states that the adjacency between to and the base verb is obligatory.
Modern grammar textbooks such as Huddleston and Pullum’s A Student’s Introduction to English Grammar (2005) point out that “[p]hrases like to really succeed have been in use for hundreds of years.” They also claim that “in some cases placing the adjunct between to and the verb is stylistically preferable to other orderings.”
The data come from the COHA consists of about 475 million word tokens and 115,000 texts. The corpus is balanced by genre across twenty decades from the 1810s to the 2000s. It is perfect to see how American English has changed in the past two centuries see this video.
split <- read.table("https://tinyurl.com/splitunsplit", header=T, sep="\t")
head(split)
## file_id genre decade year matches construction
## 1 7631 fic 1810s 1818 to_to properly_rr complete_vvi split
## 2 7168 fic 1820s 1827 to_to publicly_rr denounce_vvi split
## 3 7168 fic 1820s 1827 to_to maturely_rr weigh_vvi split
## 4 7168 fic 1820s 1827 to_to absolutely_rr demand_vvi split
## 5 7168 fic 1820s 1827 to_to madly_rr breast_vvi split
## 6 7169 fic 1820s 1827 to_to wholly_rr intercept_vvi split
## adverb verb
## 1 properly complete
## 2 publicly denounce
## 3 maturely weigh
## 4 absolutely demand
## 5 madly breast
## 6 wholly intercept
If you want to know how to make a data frame from an annotated corpus, please read Desagulier (2017, sec. 5.3) ;-)
split$construction <- as.factor(split$construction)
summary(split$construction)
## split unsplit
## 7196 61771
split$verb <- as.factor(split$verb)
summary(split$verb)
## be speak have act look go think
## 17925 1452 1019 877 769 731 709
## do say move live deal know talk
## 685 681 668 657 571 559 543
## become see get come walk make feel
## 538 498 460 396 381 361 344
## take understand remain pay stand keep grow
## 343 298 280 239 230 225 222
## consider rise sit write listen use answer
## 217 217 216 215 208 203 200
## appear fall meet give pass laugh operate
## 200 200 199 196 191 189 182
## respond determine follow leave proceed return increase
## 182 180 179 178 178 178 172
## run eat read state work smile play
## 171 167 167 164 162 161 159
## breathe turn rely report wait explain function
## 157 156 154 152 148 147 146
## show depend examine cry change judge find
## 145 144 144 141 139 137 136
## decide die rest produce stare accept enter
## 134 134 130 128 128 126 126
## perform develop start study stay compete learn
## 126 125 124 124 123 122 122
## mention ask bear spend tell sleep express
## 118 116 116 114 113 111 110
## fight buy appreciate describe behave travel apply
## 108 106 104 103 102 100 98
## dress (Other)
## 98 25640
I used a tweaked version of Stefan Gries’s Coll.analysis 3.5
script to select the most distinctive verbs of the split and unsplit constructions. The output is given below.
output.DCA <- read.table("https://tinyurl.com/outputDCA", header=T, sep="\t")
head(output.DCA)
## words freq.w.split freq.with.unsplit pref.occur coll.strength
## 1 understand 116 182 split 167.21
## 2 reduce 49 33 split 118.48
## 3 appreciate 52 52 split 102.60
## 4 change 53 86 split 73.97
## 5 enjoy 40 48 split 70.27
## 6 protect 24 13 split 63.44
We filter the most distinctive verb collocates and save them.
# install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
df.top.coll <- output.DCA %>%
filter(coll.strength >= 15)
nrow(df.top.coll)
## [1] 148
words <- unique(df.top.coll$words)
We load the SNGS vectors made from the COHA (1 GB!). I used a version of SGNS that is part of Levy, Goldberg, and Dagan (2015)’s hyperwords
script collection for Python.1 This version allows the user to tune hyperparameters
To train SGNS on the whole COHA (1810s–2000s), I set the hyperparameters as follows:
# load all vectors (long)
all.vectors <- read.csv("https://tinyurl.com/cohasgns", header=F, sep=" ") # careful, this may bottleneck your cpu
colnames(all.vectors) <- c("words", paste("V", seq(1, 300,1), sep=""))
We subset the vectors of the most distinctive verbs.
# load verbs
verbs.df <- as.data.frame(verbs)
# left join
#install.packages("tidyverse")
library(tidyverse)
input.tsne <- left_join(verbs.df, all.vectors, by = "words")
We obtain the following:
input.tsne <- read.table("https://tinyurl.com/inputtsne", header=T, sep="\t")
input.tsne[1:5,1:5]
## verbs V1 V2 V3 V4
## 1 understand -0.015784338 -0.007878140 -0.040333690 0.03372044
## 2 reduce -0.067971590 -0.038886357 0.105956880 -0.03643693
## 3 appreciate 0.009629898 -0.043391414 -0.007136557 0.06818054
## 4 change 0.107165165 0.041376587 -0.119515960 -0.01054170
## 5 enjoy 0.030036286 -0.009865477 -0.035851136 0.14671306
We apply dimensionality reduction with \(t\)-SNE.
# install.packages("Rtsne")
library(Rtsne)
num_dims <- 300
set.seed(7115)
rtsne_out <- Rtsne(input.tsne[,2:301], initial_dims=num_dims, max_iter=5000, perplexity=36)
We save the coordinates for the 2D plot from the output for later.
df.tsne <- as.data.frame(rtsne_out$Y)
colnames(df.tsne) <- c("dim1", "dim2")
df.tsne$verb <- input.tsne$verb
head(df.tsne)
## dim1 dim2 verb
## 1 0.59654580 3.084163 understand
## 2 -0.51275635 -3.740681 reduce
## 3 0.72215620 2.885243 appreciate
## 4 0.92457363 -5.032027 change
## 5 -0.06972965 2.450023 enjoy
## 6 0.57469809 -1.371691 protect
We plot the reference SVS.
mar.default <- c(5,2,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(jitter(rtsne_out$Y), t='n', main="", xlab="dimension 1", ylab="dimension 2", cex.lab=1.8, cex.axis=1.8, cex.main=1.8, cex.sub=1.8)
text(jitter(rtsne_out$Y), labels=input.tsne$verbs, cex=1.8)
We group decades in the original data frame into four periods:
The relevant periods can be determined empirically, as shown in this post: A data-driven approach to identifying development stages in diachronic corpus linguistics.
split$decade <- as.factor(gsub("1810s|1820s|1830s|1840s|1850s|1860s", "period_1", split$decade))
split$decade <- as.factor(gsub("1870s|1880s|1890s|1900s|1910s|1920s", "period_2", split$decade))
split$decade <- as.factor(gsub("1930s|1930s|1940s|1950s|1960s", "period_3", split$decade))
split$decade <- as.factor(gsub("1970s|1980s|1990s|2000s", "period_4", split$decade))
levels(split$decade)
## [1] "period_1" "period_2" "period_3" "period_4"
For each period, we calculate the co-occurrence frequencies between the constructions and the verbs and filter out hapax legomena.
split.periodized <- split %>%
group_by(decade) %>%
count(construction, verb) %>%
filter(n > 1)
head(split.periodized)
## # A tibble: 6 × 4
## # Groups: decade [1]
## decade construction verb n
## <fct> <fct> <fct> <int>
## 1 period_1 split abandon 2
## 2 period_1 split affect 3
## 3 period_1 split aid 2
## 4 period_1 split appreciate 2
## 5 period_1 split attend 2
## 6 period_1 split avoid 2
There is a lot of variation among the frequency values. We apply a binary-logarithm transformation and store the result in a new column.
split.periodized$density <- log2(split.periodized$n)
head(split.periodized)
## # A tibble: 6 × 5
## # Groups: decade [1]
## decade construction verb n density
## <fct> <fct> <fct> <int> <dbl>
## 1 period_1 split abandon 2 1
## 2 period_1 split affect 3 1.58
## 3 period_1 split aid 2 1
## 4 period_1 split appreciate 2 1
## 5 period_1 split attend 2 1
## 6 period_1 split avoid 2 1
Finally, we add the \(t\)-SNE output and drop NAs.
df.periodized <- left_join(split.periodized, df.tsne, by="verb")
nrow(df.periodized)
## [1] 4618
df.periodized <- df.periodized %>% drop_na()
head(df.periodized)
## # A tibble: 6 × 7
## # Groups: decade [1]
## decade construction verb n density dim1 dim2
## <fct> <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 period_1 split abandon 2 1 -0.327 0.284
## 2 period_1 split affect 3 1.58 0.958 -4.54
## 3 period_1 split appreciate 2 1 0.722 2.89
## 4 period_1 split become 3 1.58 3.29 3.86
## 5 period_1 split believe 2 1 -0.177 3.43
## 6 period_1 split break 2 1 -4.63 -0.306
We select the split
data from the first period.
df.period.1 <- df.periodized %>%
filter(decade == "period_1") %>%
filter(construction == "split")
df.period.1 <- df.period.1[,c(6,7,5)]
head(df.period.1)
## # A tibble: 6 × 3
## dim1 dim2 density
## <dbl> <dbl> <dbl>
## 1 -0.327 0.284 1
## 2 0.958 -4.54 1.58
## 3 0.722 2.89 1
## 4 3.29 3.86 1.58
## 5 -0.177 3.43 1
## 6 -4.63 -0.306 1
We re-plot the reference SVS and make the contour plots indexed on the per-period logged frequencies with the kde2d()
function from the MASS
package.
postscript("~/path.to.your.folder/density.plot.split.period.1.eps", horizontal = FALSE, onefile = FALSE, paper = "special", height=18, width=24)
mar.default <- c(5,2,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(jitter(rtsne_out$Y), t='n', main="", xlab="dimension 1", ylab="dimension 2", cex.lab=1.8, cex.axis=1.8, cex.main=1.8, cex.sub=1.8)
text(jitter(rtsne_out$Y), labels=input.tsne$verbs, cex=1.8)
# install.packages("MASS")
library(MASS)
df.period.1 <- as.data.frame(df.period.1)
bivn.kde.split.period.1 <- kde2d(df.period.1[,1], df.period.1[,2], h=df.period.1[,3], n = 500, lims = c(range(df.periodized$dim1), range(df.periodized$dim2)))
contour(bivn.kde.split.period.1, add = TRUE, col="dodgerblue", lwd=2)
dev.off()
We obtain the following plot.
We repeat the above for the remaining three periods.
df.period.2 <- df.periodized %>%
filter(decade == "period_2")
df.period.2 <- df.period.2[,c(6,7,5)]
head(df.period.2)
## # A tibble: 6 × 3
## dim1 dim2 density
## <dbl> <dbl> <dbl>
## 1 -0.327 0.284 1.58
## 2 0.958 -4.54 3
## 3 0.985 -4.81 1
## 4 0.722 2.89 4.52
## 5 -3.31 3.57 3.32
## 6 3.29 3.86 1
postscript("~/path.to.your.folder/density.plot.split.period.2.eps", horizontal = FALSE, onefile = FALSE, paper = "special", height=18, width=24)
mar.default <- c(5,2,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(jitter(rtsne_out$Y), t='n', main="", xlab="dimension 1", ylab="dimension 2", cex.lab=1.8, cex.axis=1.8, cex.main=1.8, cex.sub=1.8)
text(jitter(rtsne_out$Y), labels=input.tsne$verbs, cex=1.8)
library(MASS)
df.period.2 <- as.data.frame(df.period.2)
bivn.kde.split.period.2 <- kde2d(df.period.2[,1], df.period.2[,2], h=df.period.2[,3], n = 500, lims = c(range(df.periodized$dim1), range(df.periodized$dim2)))
contour(bivn.kde.split.period.2, add = TRUE, col="dodgerblue", lwd=2)
dev.off()
df.period.3 <- df.periodized %>%
filter(decade == "period_3")
df.period.3 <- df.period.3[,c(6,7,5)]
head(df.period.3)
## # A tibble: 6 × 3
## dim1 dim2 density
## <dbl> <dbl> <dbl>
## 1 1.68 0.624 1
## 2 0.958 -4.54 2
## 3 0.985 -4.81 1
## 4 0.722 2.89 2
## 5 2.60 2.34 1.58
## 6 -3.31 3.57 2.58
postscript("~/path.to.your.folder/density.plot.split.period.3.eps", horizontal = FALSE, onefile = FALSE, paper = "special", height=18, width=24)
mar.default <- c(5,2,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(jitter(rtsne_out$Y), t='n', main="", xlab="dimension 1", ylab="dimension 2", cex.lab=1.8, cex.axis=1.8, cex.main=1.8, cex.sub=1.8)
text(jitter(rtsne_out$Y), labels=input.tsne$verbs, cex=1.8)
library(MASS)
df.period.3 <- as.data.frame(df.period.3)
bivn.kde.split.period.3 <- kde2d(df.period.3[,1], df.period.3[,2], h=df.period.3[,3], n = 500, lims = c(range(df.periodized$dim1), range(df.periodized$dim2)))
contour(bivn.kde.split.period.3, add = TRUE, col="dodgerblue", lwd=2)
dev.off()
df.period.4 <- df.periodized %>%
filter(decade == "period_4")
df.period.4 <- df.period.4[,c(6,7,5)]
head(df.period.4)
## # A tibble: 6 × 3
## dim1 dim2 density
## <dbl> <dbl> <dbl>
## 1 -0.327 0.284 2
## 2 1.82 -3.80 2.81
## 3 1.68 0.624 1.58
## 4 4.88 0.876 1
## 5 -3.66 5.81 4.52
## 6 0.958 -4.54 3.58
postscript("~/path.to.your.folder/density.plot.split.period.4.eps", horizontal = FALSE, onefile = FALSE, paper = "special", height=18, width=24)
mar.default <- c(5,2,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(jitter(rtsne_out$Y), t='n', main="", xlab="dimension 1", ylab="dimension 2", cex.lab=1.8, cex.axis=1.8, cex.main=1.8, cex.sub=1.8)
text(jitter(rtsne_out$Y), labels=input.tsne$verbs, cex=1.8)
library(MASS)
df.period.4 <- as.data.frame(df.period.4)
bivn.kde.split.period.4 <- kde2d(df.period.4[,1], df.period.4[,2], h=df.period.4[,3], n = 500, lims = c(range(df.periodized$dim1), range(df.periodized$dim2)))
contour(bivn.kde.split.period.4, add = TRUE, col="dodgerblue", lwd=2)
dev.off()
The IL construction subsumes four prepositional patterns that denote a relation of internal location between a located entity and a reference entity: in/at the middle of NP, in/at the center of NP, in/at the heart of NP, and in the midst of NP.
Two kinds of horizontal relations emerge:
The statistically-computed semantic vector spaces provided by DSMs seem to provide psychologically relevant representations of word meaning.5
It could be argued that word2vec
suffers from the ‘one vector per word’ issue (Desagulier 2019) because it produces type-based distributional representations, i.e. an aggregation of all the contexts encountered by a single word form.
In the case of homonymy, this is clearly a problem because the resulting vector conflates contexts that may have nothing to do with each other: e.g. bat as the stick used in baseball vs. the winged mammal. Very few such cases were found in our dataset.
In the case of polysemy, some linguists might be wary of the quality of the resulting vectors and turn to token-based distributional representations instead (Fonteyn 2021).
In theory, token-based DSMs such as BERT are optimal: token-based models do provide as many representations as there are contexts and do reflect semantic complexity. In practice, however, this comes at a huge computing cost. For this reason, most BERT models can only be trained on relatively small-sized corpora, unless one has access to a supercomputer.6
In the context of my research, and given the trade-off between computational cost and the semantic generalization requirement, word2vec
is considered a reasonable alternative given that due attention was paid to the hyperparameter setup.
Another aspect that makes word2vec
a method still worth considering, despite the undeniable progress of token-based DSMs, is that it is not as much of a ‘black box’ as state-of-the-art token-based DSMs. It is possible to keep track of what the algorithm does at any stage. Conversely, most BERT distributions only offer access to last two final hidden layers. When access is granted, there is still discussion as to what layer is relevant to the semanticist (Mun 2021).
Finally, one may consider that the aggregated approach to meaning representation offered by type-based DSMs captures the core meaning of any given word type (Erk and Padó 2010).
All in all, the consistency of the above vector spaces suggests that ‘old-school’ predictive models have not said their last word.
https://github.com/williamleif/histwords/tree/master/sgns/hyperwords↩︎
This increases the probability of rare noise words slightly and improves the quality of the vector representations.↩︎
show that window size influences the vector representations significantly. Shorter context windows (e.g. \(\pm 2\)) tend to produce vector representations that foreground syntactic similarities, i.e. words that belong to the same parts of speech. Longer context windows (\(\geq \pm 5\)), on the other hand, tend to produce vector representations that foreground topical relations.↩︎
SGNS uses more negative examples than positive examples and therefore produces better representations when negative samples are greater than one. By setting negative samples to 15, we have 15 negative examples in the negative training set for each positive example.↩︎
This is at least the claimed contribution of two landmark DSMs: Hyperspace Approximation to Language (Lund and Burgess 1996) and Latent Semantic Analysis (Deerwester et al. 1990; Landauer and Dumais 1997). These frameworks do instantiate how co-occurrence patterns can simulate psychological tasks at the lexical semantic level, once post-processed with rather simple statistics. Bullinaria and Levy (2007) list studies that challenge this approach (Glenberg and Robertson 2000; see also French and Labiouse 2002). Their main argument is that theories based on word co-occurrence are symbolic and cannot hope to make contact with `real-world’ semantics.↩︎
For example, Fonteyn (2021) trains the base BERT model on a small portion of the COHA to create contextualized embeddings for over.↩︎