This course is designed to get you familiar with the R environment. First, it explains how to download and install R and R packages. It moves on to teach how to enter simple commands, use ready-made functions, and write user-defined functions. Finally, it introduces basic R objects: the vector, the list, the matrix, and the data frame. Although meant for R beginners, this chapter can be read as a refresher course by those readers who have some experience in R.

1 Downloads & setup

1.1 Downloads

Right now, you need:

Save the R code and R environment files in a folder that is easy to find! This means that the path to the folder should not be too long and should not contain spaces.

I am going to be using Rstudio. I suggest you do the same.

In order to run RStudio, you need to have already installed R 2.11.1 or higher (preferably higher). You can download the most recent version of R for your environment from CRAN.

The setup is quick and easy.

1.2 Setup

Before proceeding further, you must set the working directory. The working directory is a folder which you want R to read data from and store output into. Because you are probably using this book as a companion in your first foray into quantitative linguistics with R, I recommend that you create a folder for this course close enough to the root of your OS. Use this folder as your working directory.

You will now enter your first R command. Commands are typed directly in the R console, right after the prompt >. To know your default working directory, enter the following:

getwd()
## [1] "/data/user/d/gdesagulier/cours/M2 TAL/1.R.fundamentals"

Set the working directory by entering the path to the desired directory/folder:

setwd("C:\\ling_outil") # Windows
setwd("/ling_outil") # Mac

Finally, make sure the working directory has been set correctly by typing getwd() again.1

2 R scripts

2.1 script files

Simple operations can be typed and entered directly into the R console. However, corpus linguistics generally involves a series of distinct operations whose retyping is tedious. To save you the time and trouble of retyping the same lines of code and to separate commands from results, R users type commands in a script and save the script in a file with a special extension (.r or .R). Thanks to this extension, your system knows that the file must be opened in R.

2.2 creating a script file

To create a script file, there are several options:

  • via the drop-down menu: File > New File > R script;
  • via the R GUI: click on the blank page;
  • via a text editor: create a new text file and save it using the .r or .R extension.2

R users store their scripts in a personal library for later use so that they can reuse whole or bits of scripts for new tasks. I cannot but encourage you to do the same.

3 R packages

R comes equipped with pre-installed packages. They are part of base R. Most packages are external.

3.1 packages (what they are)

External packages community-developed extensions in the form of libraries. They add functionalities not included in the base installation such as extra statistical techniques, extended graphical possibilities, data sets, etc.

As of November 2020, the CRAN package repository displays 16,458 available packages.3

Like R releases, package versions are regularly updated. Over time, some packages may become deprecated, and R will let you know with a warning message.

3.2 downloading packages

Packages can be downloaded in several ways. One obvious way is to use the drop-down menu from the R GUI: Packages & Data > Package Installer. Another way is to enter the following:

install.packages()

When you download a package for the first time, R prompts you to select a mirror, i.e. a CRAN-certified server, by choosing from a list of countries. Select the country that is the closest to you. R then displays the list of all available packages. Click on the one(s) you need, e.g. ggplot2. Alternatively, if you know the name of the package you need, just enter:

install.packages("ggplot2") # make sure you do not omit the quotes!

Finally, if you want to install several packages at the same time, type the name of each package between quotes, and separate the names with a comma:

install.packages("Hmisc", "FactoMineR")

Note that R will sometimes install additional packages so that your desired package runs properly. These helping packages are known as dependencies. You do not need to install your packages again when you open a new R session.

In RStudio, the above works, but you may also want to use the drop-down menu: Tools > Install Packages in the pop-up window, type the package name under Packages and check the box Install dependencies.

3.3 loading packages

Downloading a package means telling R to store it in its default library. The package is still inactive.

find.package("ggplot2")
## [1] "/opt/R/4.1.2/lib64/R/library/ggplot2"

You cannot use a package unless you load it onto R. This is how you do it:

library(ggplot2) # this time, you do not need the quotes!

You will need to load the desired package(s) every time you start a new R session. Although this may seem tedious at first, it is in fact a good thing. After spending some time using R, you will have accumulated an important collection of packages, whose size would make R much slower if all were always active.

4 Simple commands

The simplest task that R can do is probably a simple arithmetic calculation such as an addition,

3+2
## [1] 5

a subtraction,

3-2
## [1] 1

a multiplication,

3*2 # note the use of the asterisk *
## [1] 6

a division,

3/2
## [1] 1.5

or an exponentiation.

3^2 # 3 to the power of 2, note the use of the caret ^
## [1] 9

The second line is R’s answer. The comment sign # tells R not to interpret what follows as code. It is very convenient for commenting your code, as I have done twice above. The number between square brackets before the result is R’s way of indexing the output. In the above example, the output consists of only one element. When the result is long, indexing becomes more useful:

1:30 # a sequence of numbers from 1 to 30 by increment of 1.
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30

Here, [1] indicates that 1 is the first element of the output, [26] that 26 is the twentysixth element.

Like any pocket calculator, R gives priority to multiplications and divisions over additions and subtractions. Compare:

10*3-2
## [1] 28

and

10*(3-2)
## [1] 10

Interestingly, R has some built-in values and mathematical functions, many of which you will use often:

abs(-13) # absolute value
## [1] 13
pi # the ratio of the circumference of a circle to its diameter
## [1] 3.141593
sqrt(8) # square root
## [1] 2.828427
round(7.23125)
## [1] 7

You can nest these built-in functions:

round(sqrt(pi))
## [1] 2

As will soon appear, R is not just a powerful calculator. It is equally good at many other things such as processing text, data structures, generating elaborate graphs, doing statistics, etc.

5 Variables and assignment

Without a way of storing intermediate results, no programming language could work. This is why all programming languages use variables. A variable is a named data structure stored in the computer’s working memory. It consists of alphanumeric code to which some programming data is assigned. For instance, you may store someone’s age (e.g. 40) in a variable named age.

age <- 40

From now on, each type you type age, the language interprets the variable as standing for its value, i.e. 40.

age
## [1] 40

Like all programming languages, R stores programming data in variables via assignment. In the example below, R assigns the numeric value 3 to the variable a and the numeric value 2 to the variable b thanks to <-. The assignment operator <- is a combination of the “less than” symbol followed by the hyphen with no space between them. The result is a left-facing arrow.4 Note that you can separate two or more commmands with a semi-colon ;.

a <- 3 ; a
## [1] 3
b <- 2 ; b
## [1] 2

From now on, each variable stands for and behaves like its value.

a+1
## [1] 4
a/b
## [1] 1.5

Variables must follow a specific syntax. You are free to choose the variable name, providing there is no space in it. For example, example_sum is ok, but example sum is not.

Also, you had better keep the variable names short, to save time in case you need to retype them later.

R is case-sensitive, which means the variable a is different from the variable A.

Regarding the assignment operator, it does not matter if you place a space before and after <-. However, < and - should never be separated with a space.

6 Functions and arguments

In R, you can use ready-made functions or user-defined functions. Let us start with ready-made functions.

6.1 Ready-made functions

Functions are preset instructions to R. You recognize a function by its name followed by a pair of brackets, e.g. somefunctionname(). A function takes arguments, i.e. elements to which the instruction will be applied and specifications as to how to apply these instructions. Arguments appear between the brackets of the function. How many and what kinds of arguments a function takes depends on the function. To access that information, just type help() and the name of the function in the brackets or just a question mark ? followed by the function name with no space and no brackets (e.g. ?mean).5 All functions validated by CRAN have a help page.

In the example below, mean() is a function for the arithmetic mean. When you type ?mean, R opens a help window for the function. You see that it minimally takes three arguments expressed in the form mean(x, trim = 0, na.rm = FALSE, ...). The first argument (x) is an R object such as a numeric vector of \(n\) observations. The second argument (trim) is “the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.” The third argument (na.rm) is “a logical value indicating whether NA values should be stripped before the computation proceeds.” In the example below, we ask R to compute the arithmetic mean of a sequence of numbers from 1 to 20 incremented by one, with no trimming and no removal of NA values (since there aren’t any). Arguments are specified explicitly:

mean(1:20, trim = 0, na.rm = FALSE) 
## [1] 10.5

When the argument names are unspecified, R processes the arguments in the default order it expects them to appear:

mean(1:20, 0, FALSE)
## [1] 10.5

Here, R understands that the first argument is numeric, the second argument is the value of trim, and the third argument is the logical value of na.rm. As seen above, the default value of trim is 0 and the default value of na.rm is FALSE. Since we do not change the default settings of these last two arguments, we do not even need to include them:

mean(1:20)
## [1] 10.5

6.2 User-defined functions

In R, you can create your own user-defined function. A function involves three steps:

  • defining the function
  • loading the function
  • running the function
  • fine-tuning the function

You may add a fourth step: fine-tuning the function.

6.2.1 defining the function

All functions follow this format:

function_name <- function(arg1, arg2, ... ){
  statements # some code
  return(object) # the value returned by the function, i.e. the result
}

Here, function\textunderscore name is the name of your function, which you are free to choose as long as it is not the name of a preset function (e.g. sum, function, mean, plot, etc.). T

he function function() takes arguments, here arg1, arg2. Use as many arguments as your function needs to run. You can specify a default value for a given argument. For instance, arg3=pi would stipulate that the third argument takes the default value \(3.141593\) unless specified otherwise. If you do not specify a default value, R expects you to specify one when you run the function.

The dots ... allow for other arguments to be passed to your function or from other functions or methods. For example, it is a convenient way of embedding other functions in your own function.

The curly brackets { } delimit the code that the body of your function consists of.

Finally, the last line of code is the return value, that is to say the result of the function.

Suppose you want to create a very simple function that determines what percentage a given value y is of another value x. As you may have guessed, this function takes two arguments: the value whose percentage you want to determine (the numerator) and the value used as a reference point (the denominator). Let us call this function percentage(). This is how you write it:

percentage <- function(x, y, ...){
  result <- (x*100)/y
  result
}

If you want to reuse the function, copy and paste it in a text file and save the file using the following name and extension: function_name.r. Right now, I advise you to store the function percentage() in a file named mypercentage.r in your working directory.

6.2.2 loading the function

Now that your function is defined, it is time to load it. You have two options: you can either copy and paste the code of your function into the R console or you can source your function from a file using source(). If you adopt the second option, you have two more options. The first option is to prompt R to open a window so as to select the file interactively.

source(file=file.choose())

The second option is to enter the path to your function file as an argument of source() with quotes:

source("function_name.r")

These two options are available whenever a function has a file argument.

6.2.3 running the function

Now that your function is loaded into R, you can run it. Suppose you want to know what percent 24 is of 256. Given how you assigned arguments in your function, \(24\) should be the first argument (x) and \(256\) should be the second argument (y). These arguments should follow this order in the function’s bracket:

percentage(24, 256)
## [1] 9.375

6.2.4 fine-tuning the function

Your function works fine, but now that it is stored safely on your computer, I suggest we fine-tune it so that its output looks nicer. First, let us reduce the number of decimals to two with round(). Minimally, this function takes two arguments. The first argument is the numeric value you wish to round, and the second argument is the number of decimal places. All we need to do is take the result, which is already saved in the named data structure result, place it in first-argument position in round(), and set the second argument to 2 to tell R we only want two decimals:

rounded_res <-round(result, 2) # reduce the number of decimals to 2

Next, let us embed the result in some explanatory text, such as “x is X.XX percent of y.” To do it, we can use the function cat(), which prints multiple objects, one after the other. As the argument sep=" " indicates, each object is separated by a space.

cat(x, "is", rounded_res,"percent of", y, sep=" ")

Because x, rounded_res, and y are named data structures, cat() will print their respective values. On the other hand, "is" and "percent of" are character strings, as signaled by the quotes. The values and the text elements are separated by a space, as specified in the sep argument (i.e. there is a space between the quotes). Now, the function looks like this:

percentage <- function(x, y, ...){
    result <- (x*100)/y
    rounded_res <-round(result, 2) 
    cat(x, "is", rounded_res,"percent of", y)
}

Let us run it again to see the changes:

percentage(24,256)
## 24 is 9.38 percent of 256

The output is definitely more user friendly.

User-defined functions are interesting to corpus linguists because they can save a lot of time when it comes to repeating identical operations.

7 R objects

This section introduces four main kinds of R objects: vectors, lists, matrices, and data frames. I will present each of them in turn. I will also introduce a fifth object (factors) when I discuss data frames.

7.1 Vectors

The vector is the most basic R object. It is an ordered sequence of data elements. There are three types of vectors: numeric vectors, logical vectors, and character vectors.

7.1.1 character vectors

Character vectors are strings of characters. They are delimited by single or double quotes. I use double quotes:

char_vec <- "once upon a time"
char_vec # print
## [1] "once upon a time"

Outside linguistics and text analysis, character vectors are mostly used for labels (e.g. in plots). Linguists exploit character vectors more systematically in corpus compiling and corpus exploration. An interesting property of character vectors that contain alphabetic characters is that you can easily convert the strings between lower case and upper case. This is done with two functions: toupper() and tolower().

char_vec <- toupper(char_vec) ; char_vec # converting char_vec from lower case to upper case
## [1] "ONCE UPON A TIME"
char_vec <- tolower(char_vec) ; char_vec # converting char_vec back to lower case
## [1] "once upon a time"

7.1.2 numeric vectors

Obviously, numeric vectors contain numbers, without quotes.

num_vec <- 10
num_vec
## [1] 10

7.1.3 logical vectors

Logical vectors contain Boolean values, namely the strings TRUE andFALSE, without quotes.

logi_vec <- FALSE
logi_vec
## [1] FALSE

When there are no quotes, R recognizes FALSE as a logical value. This value can be abbreviated to F:

logi_vec <- F
logi_vec
## [1] FALSE

7.1.4 switching between vector modes

One interesting feature regarding vectors (and parts of other R objects) is mode conversion thanks to three functions: as.numeric(), as.character(), and as.logical(). A character vector can be converted into a numeric vector if it contains elements that can be treated as numeric values. Below, three vectors are concatenated into the vector v by means of the function c().

v <- c("3", "2", "1")
v
## [1] "3" "2" "1"

You can verify that v has indeed been recognized as a character vector by displaying its structure with either mode(),

mode(v)
## [1] "character"

class(),

class(v)
## [1] "character"

or, better, str() (for ``structure’’).

str(v) # display the vector mode of v + its internal structure
##  chr [1:3] "3" "2" "1"

You may also ask R if } is a character vector with is.character():

is.character(v)
## [1] TRUE

As you may have guessed from its syntax, this question is actually a function. R answers TRUE, which is its way of saying “yes” (and FALSE means “no”). You may now proceed to the conversion and store it in R’s memory by assigning as.numeric(v) to the vector v:

v <- as.numeric(v) ; v
## [1] 3 2 1

The functions str(v) and is.numeric(v) confirm that the conversion has been successful.

str(v)
##  num [1:3] 3 2 1
is.numeric(v)
## [1] TRUE

If the vector contains a mix of numeric and non-numeric values, the conversion will only apply to those elements that can be converted into numeric values. For the other elements of the vector, NA values will be generated (NA stands for ``Not Applicable’’), and R will let you know:

v2 <- c("3","2","d", "f")
v2 <- as.numeric(v2) ; v2
## Warning: NAs introduits lors de la conversion automatique
## [1]  3  2 NA NA

A character vector can also be converted into a logical vector if it contains the following character strings: “TRUE,” “FALSE,” “T,” “F,” “true,” “false,” “NA,” or “na.” R can easily recognize these character strings and convert them into their Boolean equivalents.

v3 <- c("TRUE", "FALSE", "T", "F", "true", "false", "NA", "na")
v3 <- as.logical(v3) ; v3
## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE    NA    NA

Since you want to check if the vector is logical, use the is.logical() function.

is.logical(v3)
## [1] TRUE

A numeric vector can be converted into a character vector:

v4 <- c(3, 2, 1)
str(v4)
##  num [1:3] 3 2 1
v4 <- as.character(v4)
str(v4)
##  chr [1:3] "3" "2" "1"

The numbers now appear between quotes, which means they have been transformed into characters. A numeric vector can also be transformed into a logical vector. In this case, 0 will be interpreted as FALSE and all the other numbers as TRUE:

v5 <- c(0,1,2,3,10,100,1000)
str(v5)
##  num [1:7] 0 1 2 3 10 100 1000
v5 <- as.logical(v5)
str(v5)
##  logi [1:7] FALSE TRUE TRUE TRUE TRUE TRUE ...

When a logical vector is converted into a character vector, the Boolean values TRUE and FALSE are transformed into character strings, as evidenced by the presence of quotes. NA remains unchanged because R thinks you want to signal a missing value:

v6 <- c(TRUE, TRUE, T, F, FALSE, NA)
str(v6)
##  logi [1:6] TRUE TRUE TRUE FALSE FALSE NA
v6 <- as.character(v6)
str(v6)
##  chr [1:6] "TRUE" "TRUE" "TRUE" "FALSE" "FALSE" NA

When a logical vector is converted into a numeric vector, TRUE becomes 1 and FALSE becomes 0. NA remains unchanged for the same reason as above:

v7 <- c(TRUE, TRUE, T, F, FALSE, NA)
str(v7)
##  logi [1:6] TRUE TRUE TRUE FALSE FALSE NA
v7 <- as.numeric(v7)
str(v7)
##  num [1:6] 1 1 1 0 0 NA

Before running a concatenation, the function c() coerces its arguments so that they belong to the same mode. When there is a competition between several modes, the character mode is always given the highest priority,

v8 <- c(0, 1, "sixteen", TRUE, FALSE) # numeric, character, and logical values
str(v8)
##  chr [1:5] "0" "1" "sixteen" "TRUE" "FALSE"

and the numeric mode has priority over the logical mode.

v9 <- c(0,1,16,TRUE, FALSE) # numeric and logical values
str(v9)
##  num [1:5] 0 1 16 1 0

7.1.5 vector length

7.1.5.1 length()

Vectors have a length, which you can measure with the function length(). It takes as argument a vector of any mode and length and returns a numeric vector of length 1.6 All the vectors seen in are of length 1 because they consist of one element.

length(char_vec)
## [1] 1
length(num_vec)
## [1] 1
length(logi_vec)
## [1] 1

If you are new to R, you may find the result surprising for xbecause the vector consists of three words separated by spaces. You may expect a length of 3. Yet, length() tells us that the vector consists of one element. This is because R does not see three words, but a single character string. Because R does not make an a priori distinction between spaces and word characters (or letters), we are going to have to tell R what we consider a word is at some point.7

7.1.5.2 nchar()

R has a function to count the number of characters in a string: nchar(), which minimally takes the vector as an argument. If R treated only word characters as arguments, nchar() would do the count as follows:

and nchar(char_vec) would return 16. Since nchar() treats spaces as characters, it counts characters as follows:

and nchar(char_vec) returns 18.

nchar(char_vec)
## [1] 16

7.1.5.3 c()

To sum up, char_vec is a vector of length 1 that consists of 18 characters. For char_vec to be one vector of length 3, it would have to be the combination of three vectors of length 1.

To combine n vectors of length 1 to one vector of length n, use the function c() (for “combine”). Let us create these vectors:

vector1 <- "a"
vector2 <- "character"
vector3 <- "vector"
vector4 <- c(vector1, vector2, vector3)
vector4 
## [1] "a"         "character" "vector"
# or, faster:
vector4 <- c("a","character","vector")
vector4
## [1] "a"         "character" "vector"

This time, the length of vector4 is 3:

length(vector4)
## [1] 3

c() also works with numeric and logical vectors:

vector5 <- c(1,2,3,4,5,10,100,1000) # a numeric vector of length 8
length(vector5)
## [1] 8
vector6 <- c(TRUE, TRUE, FALSE, FALSE, NA) # a logical vector of length 5
length(vector6)
## [1] 5

7.1.5.4 paste()

The function paste() is specific to character vectors. Its first argument is one or more R objects to be converted to character vectors. Its second argument is sep, the character string used to separate the vector elements (the space is the default).

To merge several character vectors of length 1 into a single character vector of length 1, sep must have its default setting, i.e. a space:

paste("now", "is", "the", "winter", "of", "our", "discontent", sep=" ")
## [1] "now is the winter of our discontent"

The third, optional, argument of paste() is collapse, the character string used to separate the elements when you want to convert a vector of length n into a vector of length 1. In this case, you must set collapse to a space because the default is NULL.

richard3 <-  c("now", "is", "the", "winter", "of", "our", "discontent")
paste(richard3, collapse=" ")
## [1] "now is the winter of our discontent"

7.1.6 manipulating vectors

Creating a vector manually is simple, and you already know the drill:

  • for a vector of length 1, enter a character/numeric/logical value and assign it to a named vector with <-
  • for a vector of length n>1, concatenate character/numeric/logical values with and assign them to a named vector with <-

Let me show you a few more functions that come in handy when you create or manipulate vectors.

7.1.6.1 seq()

seq() allows you to create a numeric vector that contains a regular sequence of numbers. Minimally, it takes two arguments. The first argument (from) is the starting point of the sequence. The second argument to is the endpoint of your sequence. By default, the increment is 1.

# generate a sequence of ten integers comprised between 1 and 10:
seq(1,10) 
##  [1]  1  2  3  4  5  6  7  8  9 10

Here is an equivalent:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

You can change the increment by adding a third argument (by). Let us set it to 2:

seq(1,10,2)
## [1] 1 3 5 7 9

7.1.6.2 rep()

The function rep() replicates the value(s) in its first argument, which can be a vector. The second argument specifies how many times the first argument is replicated.

rep("here we go", 3)
## [1] "here we go" "here we go" "here we go"

Above, the character vector of length 1 "here we go" is replicated three times in a vector of length 3. Do you see what happens when a sequence is embedded into a replication?

rep(1:3, 3)
## [1] 1 2 3 1 2 3 1 2 3

7.1.6.3 sort()

Vectors can be sorted with sort().8 Suppose you have a vector containing the names of some major linguists.

linguists <- c("Langacker", "Chomsky", "Lakoff", "Pinker", "Jackendoff", "Goldberg")

To sort the names in ascending alphabetical order, just place linguists in argument position.

sort(linguists)
## [1] "Chomsky"    "Goldberg"   "Jackendoff" "Lakoff"     "Langacker" 
## [6] "Pinker"

If you want to sort the names in descending alphabetical order, add decreasing=TRUE.

sort(linguists, decreasing=TRUE)
## [1] "Pinker"     "Langacker"  "Lakoff"     "Jackendoff" "Goldberg"  
## [6] "Chomsky"

Suppose now that you have a vector with the years of birth of the participants in a psycholinguistic experiment.

years <- c(1990, 1995, 1988, 1961, 1937, 1992, 1976, 1977)

Again, use sort() to sort the years in ascending or descending order.

sort(years) # ascending order
## [1] 1937 1961 1976 1977 1988 1990 1992 1995
sort(years, decreasing = TRUE) # descending order
## [1] 1995 1992 1990 1988 1977 1976 1961 1937

When a vector is of length n>1, you can easily extract one or several elements from the vector. Extraction is made possible thanks to how R indexes the internal structure of vectors. It is done thanks to square brackets []. Let us first create a character vector of length 6:

v10 <- c("I", "bet", "you", "already", "like", "R")

Each element of the vector is indexed as follows:

To extract the fourth element, which we expect to be "already", all we need to do is specify the index:

v10[4]
## [1] "already"

To extract several parts of the vector, embed c() and make your selection:

v10[c(1, 5, 6)]
## [1] "I"    "like" "R"
v10[c(3, 5, 6)]
## [1] "you"  "like" "R"

You can also rearrange a selection of vector elements:

v10[c(1, 5, 3)]
## [1] "I"    "like" "you"

The same techniques also allows for the substitution of vector elements:

v10[c(2,5)] <- c("know", "love"); v10
## [1] "I"       "know"    "you"     "already" "love"    "R"

7.1.6.4 names()

An interesting property of vectors elements is that you can name them with the function names(). This function does not appear to the right of <- but to its left. Its argument is the vector whose elements you want to name.

v11 <- c("thank you", "merci", "danke", "grazie", "gracias")
names(v11) <- c("English", "French", "German", "Italian", "Spanish")
v11
##     English      French      German     Italian     Spanish 
## "thank you"     "merci"     "danke"    "grazie"   "gracias"

you can use the names to extract vector elements.

v11[c("French", "Spanish")]
##    French   Spanish 
##   "merci" "gracias"

7.1.6.5 recycling

When two vectors of different lengths are involved in an operation, R recycles the values of the shorter vector until the length of the longer vector is matched. This is called recycling. Do you understand what happens below?

c(1:5)*2
## [1]  2  4  6  8 10

The shorter vector 2 is recycled to multiply ach element of the sequence vector by 2. Here is another example where the length of the shorter vector is greater than 1. Again, do you understand what R does?

c(1:5)*c(2,3)
## Warning in c(1:5) * c(2, 3): la taille d'un objet plus long n'est pas multiple
## de la taille d'un objet plus court
## [1]  2  6  6 12 10

R multiplies each element of the sequence by either 2 or 3, in alternation: R multiplies each element of the sequence by either 2 or 3, in alternation: \(1\times 2\), \(2\times 3\), \(3\times 2\), \(4\times 3\), and \(5\times 2\).

7.1.7 logical operators

Thanks to logical operators, you can tell R to decide which vector elements satisfy particular conditions.9

Here are the most common logical operators:

Often, logical operators are used as arguments of the function which(), which outputs the positions of the elements in the vector that satisfy a given condition.

v12<-seq(0,20,2) ; v12 # a sequence from 0 to 20 with increment 2
##  [1]  0  2  4  6  8 10 12 14 16 18 20
which(v12>=6) # which element is greater than or equal to 6?
## [1]  4  5  6  7  8  9 10 11

Bear in mind that which outputs the positions of the elements that satisfy the condition “greater than or equal to 6,” not the elements themselves.

which(v12 < 4 | v12 > 14) # which elements of v12 are less than 4 OR greater than 14?
## [1]  1  2  9 10 11
which(v12 > 2 & v12 < 10) # which elements of v12 are greater than 2 AND less than 10?
## [1] 3 4 5
which(v12!=0) # which elements of v12 are not 0?
##  [1]  2  3  4  5  6  7  8  9 10 11

Logical operators also work with character vectors.

which(v11!="grazie") # which elements of v11 are not "grazie"
## English  French  German Spanish 
##       1       2       3       5
which(v11=="danke"|v11=="grazie") # which elements of v11 are "danke" or "grazie"
##  German Italian 
##       3       4

Each time, the names are preserved. They appear on top of the relevant vector positions.

7.1.8 loading vectors

If the vector is saved in an existing file, you can load it using scan(). In the example below, we assume the vector to be loaded is a character vector.

load_v <- scan(file=file.choose(), what="char", sep="\n")

The function file.choose() opens a window to select the file interactively.

The argument file can also be a path. A pathname (also known as a path) is the location where a computer file or any other object is located. Suppose that the text file <myfile.txt> is located in a folder named , which is itself located inside another folder named , which is itself located at the root of your hard drive (on Mac OS, the root folder is called Macintosh HD and is simply named /, whereas on Windows, it is C:). To access the text file, you provide its full path.

On a Mac, the full path is likely to be the following: /myfolder/mysubfolder/myfile.txt. On a PC running on Windows, the path is likely to be the following: C:/myfolder/mysubfolder/myfile.txt.

The code below opens an example character vector stored on a server example_character_vector.txt:

load_v <- scan(file="https://bit.ly/3mlXPoY", what="char", sep="\n")
load_v
## [1] "<html>"                                    
## [2] "<head>"                                    
## [3] "\t<meta charset=\"UTF-8\">"                
## [4] "\t<title></title>"                         
## [5] "</head>"                                   
## [6] "<body>"                                    
## [7] "Le fichier souhaité n'est plus disponible!"
## [8] "</body></html>"

If the file were stored locally (on your computer), this is an example of what the path would look like on macOS:

load_v <- scan(file="/CLSR/chap2/example_character_vector.txt", what="char", sep="\n")

A note on entering pathnames on Windows. One difficulty that I had to address during the writing stage is the difference between PCs running on Windows and Macs running on macOS with respect to pathnames.

If you are a Windows user, you may have to convert the single slash to a pair of backslashes whenever you have to enter a file path, like so: C:\\myfolder\\mysubfolder\\myfile.txt

If you think that entering pathnames is tedious, you can still use file.choose(), which will prompt R to open an interactive window from which you can select the desired file. You can also set the working directory to where your data are. By doing so, you will only need to write the name of the file and its extension.

The argument what is set to "char", which stands for “character strings.” Finally, sep is set to "\n" for “new line.” This means that R uses line breaks to delimit vector elements. In other words, scan() expects to read new-line delimited vector elements. Given this specification, the vector load_v has 3 elements (= load_v is of length 3):

length(load_v)
## [1] 8

Other common separators are: " " (a space) and "\t" (a tab stop).

If you set sep to a space…

load_v2 <- scan(file="https://bit.ly/3mlXPoY", what="char", sep=" ")
## Warning in scan(file = "https://bit.ly/3mlXPoY", what = "char", sep = " "): Fin
## de fichier (EOF) dans une chaîne de caractères entre guillements

…the length of the vector changes accordingly.

length(load_v2)
## [1] 11

7.1.9 saving vectors

To save a vector, use the cat() function.

cat(... , file, sep = " ", append = FALSE)
  • ... an R object (here the vector you want to output);
  • file the file to print to (if you do not provide it, the vector is printed onto the R console);
  • sep the separator to append after each element (because generally you want each vector to have its own line, it is best to set sep to "\n");
  • append whether you want to append the output to pre-existing data in a file.

Let us save v11 in a plain text file named my.saved.vector.txt in your working directory. The file does not exist yet, but R will create it for you because you have provided the .txt extension.

cat(v11, file="C:\\yourworkingdirectory\\my.saved.vector.txt", sep="\n") # Windows
cat(v11, file="/yourworkingdirectory/my.saved.vector.txt", sep="\n") # Mac

Open your working directory to check if the vector has been saved properly.

7.2 Lists

Many R functions used in corpus linguistics and statistics return values that are lists. A list is a data structure that can contain R objects of varying modes and lengths. Suppose you have collected information regarding six corpora.10 For each corpus, you have the name, the size in million words, and some dialectal information as to whether the texts are in American English. You also have the total size of all the corpora. Each item is stored in a separate vector.

corpora <- c("COCA", "COHA", "TIME", "American Soap", "BNC", "Strathy")
size <- c(450, 400, 100, 100, 100, 50)
us_english <- c(TRUE, TRUE, TRUE, TRUE, FALSE, FALSE)
total_size <- 1200

All these vectors can be stored in a list using the `list() function.

yourlist <- list(corpora, size, us_english, total_size)

List components can be named as you create the list.

yourlist <- list(corpora=corpora, 
                 size_in_M_words=size, 
                 is_dialect_us=us_english, 
                 total_size=total_size)
yourlist
## $corpora
## [1] "COCA"          "COHA"          "TIME"          "American Soap"
## [5] "BNC"           "Strathy"      
## 
## $size_in_M_words
## [1] 450 400 100 100 100  50
## 
## $is_dialect_us
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## 
## $total_size
## [1] 1200

Components can be accessed by * name using $, or * position using double square brackets [[]].

yourlist$size # the second element accessed by name
## [1] 450 400 100 100 100  50
yourlist[[2]] # the second element accessed by position
## [1] 450 400 100 100 100  50

Sometimes, you will want to extract individual components from the list so as to manipulate them. This is done with unlist().

unlist(yourlist[[2]])
## [1] 450 400 100 100 100  50

The output is a vector containing the values of the second list component.

7.3 Matrices

A matrix is a two-dimensional table. Tab. 7.1 is a good example of what can be loaded into R in the form of a matrix.

Table 7.1: An example matrix (<each + N>, <every + N>, and <each and every + N> in three corpora of English)
BNC COCA GloWbE
each + N 30708 141012 535316
every + N 28481 181127 857726
each and every + N 140 907 14529

Its entries are the frequency counts of three patterns in three corpora of English.

To create a matrix, use the matrix() function.

matrix(data = NA, 
       nrow = 1, 
       ncol = 1, 
       byrow = FALSE,
       dimnames = NULL)
  • x: a vector of values
  • nrow: the number of rows
  • ncol: the number of columns
  • byrow: how you enter the values in the matrix
  • dimnames: a list containing vectors of names for rows and columns

To make a matrix from Tab. 7.1, start with the vector of values. By default, the matrix is filled by columns (from the leftmost to the rightmost column, and from the top row to the bottom row).

values <- c(30708, 28481, 140, 141012, 181127, 907, 535316, 857726, 14529)

Embed the vector in the matrix() function.

mat <- matrix(values, 3, 3)
mat # inspect
##       [,1]   [,2]   [,3]
## [1,] 30708 141012 535316
## [2,] 28481 181127 857726
## [3,]   140    907  14529

The data is indexed in a [row, column] fashion. To access the value in the second row of the third column (i.e. 857726), enter:

mat[2,3]
## [1] 857726

This allows you to perform calculations. You may calculate row sums and column sums with rowSums() and colSums() respectively.

rowSums(mat)
## [1]  707036 1067334   15576
colSums(mat)
## [1]   59329  323046 1407571

You can sum the whole matrix with sum():

sum(mat)
## [1] 1789946

The result is the sum of all row totals or column totals.

Optionally, you can add the row names and the column names with the argument . These names should be stored in two vectors beforehand.

row_names <- c("each + N", "every + N", "each and every + N")
col_names <- c("BNC", "COCA", "GloWbE")

Bear in mind that dimnames must be a list of length 2.11 The vectors row_names and col_names must therefore be placed in a list.

mat <- matrix(values, 3, 3, dimnames=list(row_names, col_names))
mat
##                      BNC   COCA GloWbE
## each + N           30708 141012 535316
## every + N          28481 181127 857726
## each and every + N   140    907  14529

7.4 Data frames (and factors)

In quantitative corpus linguistics, analyses involve a set of related observations which are grouped into a single object called a data set. For example, you might collect information about two linguistic units so as to compare them in the light of descriptive variables (qualitative and/or quantitative).

7.4.1 Ready-made data frames in plain text format

For illustrative purposes, let me anticipate on a case study that we will come back to: quite and rather in the British National Corpus. Typically, you extract observations containing the units and you annotate each observation for a number of variables. With regard to quite and rather, these variables can be:

  • the name of the corpus file and some information about it,
  • the intensifier in context (quite or rather),
  • the intensifier out of context,
  • the syntax of the intensifier (preadjectival or pre-determiner),
  • the intensified adjective co-occurring with the intensifier and the NP modified by the adjective,
  • the syllable count of the adjective and the NP.

The above translates into a data set, sampled in Tab. 7.2. If you store the data set in a matrix, with observations in the rows and variables in the columns, you will benefit from the rigorous indexing system to access the data, but the matrix will find it difficult to accommodate different modes. It will for instance coerce numeric variables such as syllable count ADJ and syllable count NP into character variables, which we do not necessarily want (don’t forget, you decide and R executes).

Table 7.2: A sample data frame (quite and rather in the BNC)
corpus_file corpus_file_info match intensifier construction adjective syllable_count_adj NP syllable_count_NP
KBF.xml S conv a quite ferocious mess quite preadjectival ferocious 3 mess 1
AT1.xml W biography quite a flirty person quite predeterminer flirty 2 person 2
A7F.xml W misc a rather anonymous name rather preadjectival anonymous 4 name 1
ECD.xml W commerce a rather precarious foothold rather preadjectival precarious 4 foothold 2
B2E.xml W biography quite a restless night quite predeterminer restless 2 night 1
AM4.xml W misc a rather different turn rather preadjectival different 3 turn 1
F85.xml S unclassified a rather younger age rather preadjectival younger 2 age 1
J3X.xml S unclassified quite a long time quite predeterminer long 1 time 1
KBK.xml S conv quite a leading light quite predeterminer leading 2 light 1
EC8.xml W nonAc: humanities arts a rather different effect rather preadjectival different 3 effect 2

This is where the data frame steps in: it combines the ease of indexing of a matrix with the accommodation of different modes, providing each column displays variables of a single mode.

The sample data set is available here. As its extension indicates (sample.df.txt), it is a text file.12

To load a data frame, you may use read.table(). Because it has a large array of argument options, it is very flexible (type ?read.table to see them), but for the same reason, it is sometimes tricky to use. The essential arguments of read.table() are the following:

read.table(file, header = TRUE, sep = "\t", row.names=NULL)
  • file: the path of the file from which the data set is to be read;
  • header=TRUE: your dataset contains column headers in the first row; R sets it to TRUE automatically if the first row contains one fewer field than the number of columns;
  • sep="\t": the field delimiter (here, a tab stop);
  • row.names=NULL: by default, R numbers each row.

The tricky part concerns the compulsory attribution of row names. If you specify row.names, you must provide the row names yourself. They can be in a character vector whose length corresponds to the number of rows in your data set. The specification can also be the number of the column that contains the row names. For example, row.names=1 tells R that the row names of the data frame are in the first column. When you do this, you must be aware that R does not accept duplicate row names or missing values. Imagine you collect 1000 utterances from 4 speakers, each speaker contributing 250 utterances. You cannot use the speakers’ names as row names because each will be repeated 250 times, and the data frame will not load. Because sample.df.txt does not have a column that contains row names, we stick to the default setting of row.names. To load it, we just enter the following:

df <- read.table("https://tinyurl.com/sampledftxt", header=TRUE, sep="\t")
df
##    corpus_file         corpus_file_info                        match
## 1      KBF.xml                   S conv      a quite ferocious mess 
## 2      AT1.xml              W biography        quite a flirty person
## 3      A7F.xml                   W misc     a rather anonymous name 
## 4      ECD.xml               W commerce a rather precarious foothold
## 5      B2E.xml              W biography       quite a restless night
## 6      AM4.xml                   W misc      a rather different turn
## 7      F85.xml           S unclassified        a rather younger age 
## 8      J3X.xml           S unclassified           quite a long time 
## 9      KBK.xml                   S conv        quite a leading light
## 10     EC8.xml W nonAc: humanities arts   a rather different effect 
##    intensifier  construction  adjective syllable_count_adj       NP
## 1        quite preadjectival  ferocious                  3     mess
## 2        quite predeterminer     flirty                  2   person
## 3       rather preadjectival  anonymous                  4     name
## 4       rather preadjectival precarious                  4 foothold
## 5        quite predeterminer   restless                  2    night
## 6       rather preadjectival  different                  3     turn
## 7       rather preadjectival    younger                  2      age
## 8        quite predeterminer       long                  1     time
## 9        quite predeterminer    leading                  2    light
## 10      rather preadjectival  different                  3   effect
##    syllable_count_NP
## 1                  1
## 2                  2
## 3                  1
## 4                  2
## 5                  1
## 6                  1
## 7                  1
## 8                  1
## 9                  1
## 10                 2

Specialized implementations of read.table(), such as read.csv() or read.delim(), are more flexible.

read.delim(file, 
           header = TRUE, 
           sep = "\t", 
           quote = "\"",
           dec = ".", 
           fill = TRUE, 
           comment.char = "", 
           ...)

Loading the data frame with read.delim() works with the default argument settings:

df <- read.delim("https://tinyurl.com/sampledftxt")

You may now inspect the structure of the data frame by entering str(df).

str(df)
## 'data.frame':    10 obs. of  9 variables:
##  $ corpus_file       : chr  "KBF.xml" "AT1.xml" "A7F.xml" "ECD.xml" ...
##  $ corpus_file_info  : chr  "S conv" "W biography" "W misc" "W commerce" ...
##  $ match             : chr  "a quite ferocious mess " "quite a flirty person" "a rather anonymous name " "a rather precarious foothold" ...
##  $ intensifier       : chr  "quite" "quite" "rather" "rather" ...
##  $ construction      : chr  "preadjectival" "predeterminer" "preadjectival" "preadjectival" ...
##  $ adjective         : chr  "ferocious" "flirty" "anonymous" "precarious" ...
##  $ syllable_count_adj: int  3 2 4 4 2 3 2 1 2 3
##  $ NP                : chr  "mess" "person" "name" "foothold" ...
##  $ syllable_count_NP : int  1 2 1 2 1 1 1 1 1 2

The function outputs:

  • the number of observations (rows),
  • the number of variables (columns),
  • the type of each variable.

The data set contains two integer-type variables13 and seven factor-type variables.

A factor variable contains either nominal or ordinal values (the factors in this data set has only nominal variables). Nominal values are unordered categories whereas ordinal variables are ordered categories with no quantification between them.14

By default, R converts all nominal and ordinal variables into factors. Factors have levels, i.e. unique values. For example, the factor corpus\textunderscore file\textunderscore info has six unique values, three of which appear twice (S conv, W biography, and S unclassified).

Because a data frame is indexed in a [row, column] fashion (like a matrix), extracting data points is easy as pie. For instance, to extract the adjective restless, which is in the fifth row of the sixth column, enter:

df[5,6]
## [1] "restless"

or use the variable name between quotes.

df[5,"adjective"]
## [1] "restless"

To extract a variable from the data frame, use the dollar symbol $.15

df$adjective
##  [1] "ferocious"  "flirty"     "anonymous"  "precarious" "restless"  
##  [6] "different"  "younger"    "long"       "leading"    "different"

To extract the adjective restless, just add the row number between square brackets.

df$adjective[5]
## [1] "restless"

7.4.2 Generating a data frame manually

When your data frame is small, you can enter the data manually. Suppose you want to enter Tab. 7.3 into R. Four corpora of English are described by three variables:

  • the size in million words,
  • the variety of English,
  • the period covered by the corpus.
Table 7.3: Another example data frame
BNC COCA GloWbE
each + N 30708 141012 535316
every + N 28481 181127 857726
each and every + N 140 907 14529

First, generate four vectors, one for each column of the data frame.

corpus <- c("BNC", "COCA", "Hansard", "Strathy")
size <- c(100, 450, 1600, 50)
variety <- c("GB", "US", "GB", "CA")
period <- c("1980s-1993", "1990-2012", "1803-2005", "1970s-2000s")

You may now combine these four vectors into a data frame with the data.frame() function. If you want to use the values in the first column as row names, add the argument row.names to the function.

df.manual <- data.frame(size, variety, period, row.names = corpus)
df.manual
##         size variety      period
## BNC      100      GB  1980s-1993
## COCA     450      US   1990-2012
## Hansard 1600      GB   1803-2005
## Strathy   50      CA 1970s-2000s

If you want the values in the first column to be treated as data points, not as row names, do not provide row.names.

df.manual.2 <- data.frame(corpus, size, variety, period)
df.manual.2 # R numbers the rows
##    corpus size variety      period
## 1     BNC  100      GB  1980s-1993
## 2    COCA  450      US   1990-2012
## 3 Hansard 1600      GB   1803-2005
## 4 Strathy   50      CA 1970s-2000s

To export your data frame, use the write.table() function.16 Just like read.table(), write.table() is elaborate and flexible (enter ?write.table to see all the arguments that the function can take). For the task at hand, the following arguments will do:

write.table(... , file, quote=FALSE, sep="\t", row.names=F)
  • ...: the R object to be exported as a data frame;
  • file: the file to print to (if you do not provide it, the data frame is printed onto the R console);
  • quote: whether you want character or factor columns to be surrounded by double quotes (TRUE) or not (FALSE);
  • sep: the separator to append after each row element (here, a tab stop);
  • row.names: if your first column contains data points (not row names), set row.names to FALSE.

To export df.manual, provide col.names=NA for a proper column alignment because corpus is used for row names.

# Windows
write.table(df.manual, 
            file="C:\\pathtoyour\\workingdirectory\\my.saved.df.txt", 
            quote=FALSE,
            sep="\t", 
            col.names=NA) 
# Mac
write.table(df.manual, 
            file="/pathtoyour/workingdirectory/my.saved.df.txt", 
            quote=FALSE,
            sep="\t", 
            col.names=NA) 

To export df.manual.2, you have two options:

  • if you want to preserve row numbering, provide col.names=NA. Row numbers have a separate column.
  • if you do not want row numbering, provide row.names=FALSE.

Your data frames are now saved in tab-delimited text format in your working directory. Open these files with your spreadsheet software for inspection.

7.4.3 Loading and saving a data frame as an R data file

If you do not need to process or edit your data frame with a spreadsheet software, it is faster to load and save the data frame as R data files with the .rds extension. To save df.manual.2 as an R data file use saveRDS(). The first argument is the named data structure containing the data frame. The second argument is the path to file to save the data frame to.

# Windows
saveRDS(df.manual.2, file="C:\\pathtoyour\\workingdirectory\\df.manual.2.rds")
# Mac
saveRDS(df.manual.2, file="/pathtoyour/workingdirectory/df.manual.2.rds") 

The R data file is now saved in your working directory. To load the file, use the readRDS() function. Its argument is the path of the file from which to read the data frame.

readRDS("C:\\pathtoyour\\workingdirectory\\df.manual.2.rds") # Windows
readRDS("/pathtoyour/workingdirectory/df.manual.2.rds") # Mac

7.4.4 Converting an R object into a data frame

Some R objects can easily be converted into a data frame thanks to the as.data.frame() function. The only requirement is that the R object be compatible with the structure of a data frame. Let us convert the matrix mat into a data frame.

mat.as.df <- as.data.frame(mat)
str(mat.as.df)
## 'data.frame':    3 obs. of  3 variables:
##  $ BNC   : num  30708 28481 140
##  $ COCA  : num  141012 181127 907
##  $ GloWbE: num  535316 857726 14529

8 for-loops

A loop iterates over an object n times to execute instructions. It is useful when the R object contains a very large number of elements. The general structure of a for-loop depends on whether you have one instruction (one line of code) or more than one. With one instruction, the structure is the following.

for (i in sequence) instruction

When there are several instructions, you need curly brackets to delimit the code over which the loop will work.

for (i in sequence) {
  instruction 1 # first line of code
  instruction 2 # second line of code
  ... # etc.
}

The loop consists of the keyword for followed by a pair of brackets. These brackets contain an identifier, here i (any name is fine). The identifier is followed by in and a vector of values to loop over. The vector of values is a sequence whose length is the number of times you want to repeat the instructions. The identifier i takes on each and every value in this vector. The instructions are between curly braces. The closing brace marks the exit from the loop.

Because the above paragraph is quite a handful, let us use a minimal example. Take the set of the first five letters of the alphabet.

letters[1:5]
## [1] "a" "b" "c" "d" "e"

The for loop in the code below

for (i in letters[1:5]) print(i)
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"

is equivalent to the following:

i <- "a"
print(i)
## [1] "a"
i <- "b"
print(i)
## [1] "b"
i <- "c"
print(i)
## [1] "c"
i <- "d"
print(i)
## [1] "d"
i <- "e"
print(i)
## [1] "e"

Obviously, the for-loop is more economical as i takes on the value of “a,” “b,” “c,” “d,” and “e” successively. With an additional line of code, the curly brackets are required.

for (i in letters[1:5]) {
  i <- toupper(i)
  print(i)
}
## [1] "A"
## [1] "B"
## [1] "C"
## [1] "D"
## [1] "E"

Loops are commonly used to cycle over rows of matrices or data frames (especially if there are many of rows). They also come in handy when you need to process a high number of files. The identifier is used to take on each and every value in a vector that contains the names of the corpus files. As with almost anything in R, for-loops can be nested.

One known issue with loops is that they get slower as the number of iterations increases. If the object to loop over is very large, make sure that you keep as many instructions outside the loop as you can. Because of this, loops have a bad reputation among R users. Indeed, you will sometimes have to look for alternatives to for loops: e.g. the iterators package, or the functions apply(), lapply(), tapply(), or sapply().

There are two other kinds of loops: while loops and repeat loops. To know more about them, enter ?Control in R.

9 if and if...else

If you want to express a condition in R, use if statements. There are two kinds of if statements: simple if statements and if...else statements. Of course, if statements can be nested.

9.1 if statements

The general structure of an if statement depends on the number of statements. If there is a unique statement, the structure is the following.

if (condition) statement

If there are several statements, you need to use curly brackets.

 if (condition) {
  statement 1 # first statement
  statement 2 # second statement
  ...
}

The vector integer contains negative and positive integers.

integer <- c(-7, -4, -1, 0, 3, 12, 14)

Suppose you want to tag each negative integer in the vector with the character string “-> negative.” Using a for-loop, this is how you can do it.

for (i in 1:length(integer)) {
  if (integer[i] < 0) print(paste(integer[i], "-> negative"))
}
## [1] "-7 -> negative"
## [1] "-4 -> negative"
## [1] "-1 -> negative"

The instructions in the loop are repeated as many times as there are elements in the vector integer, that is to say length(integer) times. The identifier i takes on each and every value in this vector from 1 to 7 (7 being the length of the vector). The if statement is placed between the curly braces of the for loop. The condition is delimited with brackets after if. It can be paraphrased as follows: “if the integer is strictly less than 0….” The statement that follows can be paraphrased as “print the integer, followed by the character string”-> negative". The tag is appended only if the condition is true.

9.2 if...else statements

When you want R to do one thing if a condition is true and another thing is the condition is false — rather than do nothing, as above — use an if...else statement. Suppose you want to tag those elements of the vector integer that are not negative with the character string “-> positive.” All you need to do is add a statement preceded by else to tell R what to do if the condition is false.

for (i in 1:length(integer)) {
  if (integer[i] < 0) print(paste(integer[i], "-> negative"))
  else print(paste(integer[i], "-> positive"))
}
## [1] "-7 -> negative"
## [1] "-4 -> negative"
## [1] "-1 -> negative"
## [1] "0 -> positive"
## [1] "3 -> positive"
## [1] "12 -> positive"
## [1] "14 -> positive"

A similar result is obtained with ifelse(). The structure of this function is more compressed than an if...else statement. It is also a more efficient alternative to for loops.

ifelse(condition, what to do if the condition is true, what to do if the condition is false)
ifelse(integer < 0, "-> negative", "-> positive")
## [1] "-> negative" "-> negative" "-> negative" "-> positive" "-> positive"
## [6] "-> positive" "-> positive"

The main difference between the for loop and the ifelse() function is the output. The former outputs as many vectors as there are iterations. The latter outputs one vector whose length corresponds to the number of iterations.

Because zero is neither negative nor positive, you should add a specific condition to make sure that 0 gets its own tag, e.g. “-> zero.” That second if statement will have to be nested in the first if statement and appear after else.

for (i in 1:length(integer)) {
  if (integer[i] < 0) print(paste(integer[i], "-> negative")) # first if statement
  else if (integer[i] == 0) print(paste(integer[i], "-> zero")) # second (nested) if statement
  else print(paste(integer[i], "-> positive"))
}
## [1] "-7 -> negative"
## [1] "-4 -> negative"
## [1] "-1 -> negative"
## [1] "0 -> zero"
## [1] "3 -> positive"
## [1] "12 -> positive"
## [1] "14 -> positive"

Once nested, ifelse() allows you to do the same operation (although with a different vector output) in one line of code.

ifelse(integer < 0, "-> negative", ifelse(integer == 0, "-> zero", "-> positive"))
## [1] "-> negative" "-> negative" "-> negative" "-> zero"     "-> positive"
## [6] "-> positive" "-> positive"

The nested ifelse statement is in fact what R does if the condition of the first ifelse statement is false.

10 Exercises

10.1 Vectors

  1. Without using the R console, say what z contains after the following assignments:
x <- c(3, 8)
y <- 2
z <- c(x,y)
x <- c(3, 7, 9)
y <- x[3]-x[2]
z <- y + x[1]
  1. Explain the error message that you obtain when you enter the following command:
x <- c("3", "7", "9")
x[1]+x[2]
  1. Without using the R console, say what R outputs after the following two lines of code.
y <- c(1,3,5,5)
y[c(1,3)]
  1. Without using the R console, say what g contains.
i <- rep(1,5)
j <- rep(6,7)
k <- rep(i, 3)
g <- c(i,j,k)
  1. Without using the R console, give the type and length of each vector.
ww <- 1+3
xx <- c(ww,4) > yy <- c(xx,8)

What contains z?

z <- rep(yy,3)
  1. Break down the complex vector i using smaller intermediate vectors.
i <- rep(c(seq(1,3), mean(c(2,5,9))), 3)
  1. Assign the names “a,” “b,” “c,” “d” and “e” to a sequence of integers from 1 to 5 with increment 1.

  2. Create the following vectors.

  1. "a" "b" "c" "a" "b" "c"
  2. TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
  1. Without using the nchar() function, say how many characters the following vector contains.
character_vector <- "R is great!"

10.2 Matrices

Tab. 10.1 is a matrix that displays the frequency distribution of sure and hell in A as NP in the British National Corpus (Desagulier 2016).

Table 10.1: Co-occurrence table for sure and hell in the BNC hell other NPs
‘hell’ other NPs
‘sure’ 33 17
other adjectives 65 3638
  1. Enter this table into R in a matrix format. Include row names and column names.
  2. Calculate the row sums and the column sums.
  3. Calculate what percentage the frequency of sure as hell is of the total number of A as NP constructions.

10.3 Data frames

Table 10.2 displays the top five covarying collexemes of A as NP in the BNC according to the log-likelihood ratio score (also known as \(G^2\)). For example, the strongest attraction between the adjective and the NP in the construction is between good and gold in good as gold (Desagulier 2016).

Table 10.2: The top five pairs of covarying collexemes of A as NP in the BNC
rank A NP G2
1 good gold 288.82
2 quick flash 189.29
3 right rain 175.98
4 large life 164.55
5 safe houses 148.32
  1. Enter this data set into R in a data frame format. Make sure the values in the first column are treated as data points, not as row names.
  2. Calculate the mean \(G^2\) score of all five pairs.
  3. Calculate the mean \(G^2\) score of the top three pairs.
  4. Sort the adjectives alphabetically in ascending order.
  5. Sort the NPs alphabetically in descending order.

References

Desagulier, Guillaume. 2016. “A Lesson from Associative Learning: Asymmetry and Productivity in Multiple-Slot Constructions.” Journal Article. Corpus Linguistics and Linguistic Theory 12 (1).
Haspelmath, Martin. 2011. “The Indeterminacy of Word Segmentation and the Nature of Morphology and Syntax.” Folia Linguistica 45 (1): 31–80.

  1. Note that if you have a PC running on Windows, you might be denied access to the C: drive. The default behavior of that drive can be overridden.↩︎

  2. Free text editors abound. I recommend Notepad++ or Tinn-R for Windows users, and BBedit for Mac users.↩︎

  3. See this repository for an updated count.↩︎

  4. Even though I do not use it in this book to avoid confusion, the equal sign = can also be used for assignment instead of <-.↩︎

  5. When you type the function name and the first bracket, R recognizes the function and displays the list of all possible arguments at the bottom of the console.↩︎

  6. Its argument can also be any R object, under conditions. See ?length.↩︎

  7. The question of what counts as a word is not trivial. Haspelmath (2011) argues that a word can only be “defined as a language-specific concept.” Corpus linguists must therefore provide the computer with an ad hoc definition for each language that they work on.↩︎

  8. In linguistics, you generally sort character and numeric vectors.↩︎

  9. Logical operators are not specific to R. They are found in most programming languages.↩︎

  10. All of them can be accessed here: https://www.english-corpora.org/.↩︎

  11. If dimnames is of length 1, it is assumed that the list contains only row names.↩︎

  12. Beside .txt, .csv (for “comma separated file”) is a common extension for data frames stored in plain text format.↩︎

  13. Integer is a subtype of numeric.↩︎

  14. If you grade people’s language proficiency on a scale from 0 to 5, you have an ordinal variable containing ordered values (5 is higher than 4, which is higher than 3, etc.), but the difference between these values is not proportional. For example, someone whose proficiency level is graded as 3 is not necessarily three times as proficient as someone whose proficiency level is graded as 1.↩︎

  15. $ is also used to access list elements. This is because a data frame is similar to a list (enter mode(df) to see that the data frame is recognized as a list with regards to its mode).↩︎

  16. Although write.csv() exists, there is no write.delim() equivalent.↩︎



logo

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.