Data programming and intro to R

Prepare, clean, transform and enrich data in R

Master in Computational Social Sciences • Javier Álvarez Liébana

Welcome to R!

Put your spreadsheets, SAS and SPSS aside

Hi!

Mail: .

  • Javier Álvarez Liébana from Carabanchel (Madrid).

  • Degree in Mathematics (UCM). PhD in Statistics (UGR).

  • In charge of data visualization and analysis for the Principality of Asturias (2021-2022) during the COVID pandemic

  • Member of the Spanish Society of Statistics and OR and the Spanish Royal Mathematical Society.

Currently, Assistant Professor at the Faculty of Statistics of the UCM. Disseminating via Twitter e Instagram

Goals

  • Take away the fear of programming → learn to program by programming

  • Understanding basic R concepts from scratch → learning to abstract ideas and algorithms

  • Utility of programming → reproducible, transparent and maintainable workflows.

  • Introduction to analysis and preprocessing of data{tidyverse}.

  • Handling advanced data types{forcats}, {lubridate} and {purrr} packages

Evaluation: intro to R

  • Attendance and individual participation (30%)
  • Final exam on 10/09/2024, 15:00-16:10 (70%).
  • Max grade: to get a grade greater than 9/10 you should get at least 9/10 in the final exam.

Evaluation: data programming

  • Attendance and individual participation (10%)
  • 2 individual tasks done during the course (20%-25%)
  • 1 group task between 4 and 6 people (20%). Deadline: 23/12/2024.
  • Final exam with all materials and internet (25%). You are exempt if you get more than 7.5/10 in the previous tasks.
  • Max grade: to get a grade greater than 9/10 you should get at least 9/10 in the individual tasks.

Planning

  • Final exam (intro R 70%): 10/09/2024 (15:00 - 16:10).

 

  • Individual task I (20%): deadline 03/11/2024

  • Individual task II (25%): deadline 30/11/2024

  • Group task (20%): deadline 23/12/2024.

 

  • Final exam (25% if required): TBA

Planning: intro R

LESSON WEEK DATES TOPIC EX. WORKBOOK TASK
0-1 S1 2 sep First steps: R base programming 💻 💻
0-2 S1 4 sep First data: concatenate values and databases 💻 💻 💻 🐣 🐣
0-3 S1 6 sep Quarto and if-else 💻 💻 💻 🐣 🐣 🐣
0-4 S2 10 sep Final task 🎯 (70%)

Planning: data programming

LESSON WEEK DATES TOPIC EX. WORKBOOK TASK
1 S2 12 sep Intro to tidy data 💻 💻 🐣
2 S3 19 sep Tidyverse: by rows 💻 💻 🐣
3 S4 26 sep Tidyverse: by columns 💻 💻 🐣 🐣 🐣
4 S5 3 oct Tidyverse: summary 💻 🐣 🐣
5 S6 10 oct Loops 💻 🐣 🐣
6 S7 17 oct Functions and joins 💻 💻 🐣 🐣
7 S8 24 oct Import/export files. R base vs tidyverse
8 S10 6 nov Purrr and forcats packages
9 S11 13 nov Github and SQL connection

Materials

 

Datasets

 

  • airquality from the {datasets} package (already installed by default): daily measurements (153 observations) of air quality in New York, from May to September 1973. Measured 6 variables: ozone, solar radiation, wind, temperature, month and day.

  • surveys: we have all poll surveys for Spain from 1982 to 2019 collected from Wikipedia.

  • discursos: compiles the Christmas speeches of the heads of state in Spain (under dictatorship and democracy) from 1946 to 2021.

Datasets

 

  • relig_income from the {tidyr} package: compiles annual income data by religion (original source https://www.pewresearch.org/religion/religious-landscape-study/)

  • who2 from the {tidyr} package: WHO data on tuberculosis infections, disaggregated by disease type, sex and age.

  • billboard from the {tidyr} package: top 100 ranking of songs according to Billboard (something like The 40) for the year 2000.

  • world_bank_pop from the {tidyr} package: data from the World Bank about population per country from 2000 to 2018.

L1: first steps

Introduction to R and RStudio. Working with projects. First uses of functions and packages. Basic data types

Requirements

For the course, the only requirements will be:

  1. Internet connection (to download some data and packages).
  1. Install R: it will be our language. We will download it (for free) from https://cran.r-project.org/

R vs RStudio

We will program as we write

  • We will need a grammar, a language (R)
  • And an environment, such as Word (RStudio) to write it

Installing R

The R language will be our grammar and spelling (our rules of the game)

  • Step 1: go to https://cran.r-project.org/ and select your operating system.

  • Step 2: for Mac, simply click on the .pkg file, and open it once downloaded. For Windows systems, we need to click on install R for the first time and then on Download R for Windows. Once downloaded, open it like any installation file.

  • Step 3: open the installation executable.

Cuidado

Whenever you need to download something from CRAN (either R itself or a package), make sure you have an internet connection.

First operation

To check the installation, after opening R, you should see the R GUI (Graphical User Interface) with a white screen similar to this (console).

First code: we will assign the value 1 to a variable called a (we will write the code in the console and press “enter”). Then we will do the sum a + b.

a <- 1

First operation

To check the installation, after opening R, you should see the R GUI (Graphical User Interface) with a white screen similar to this (console).

First code: we will assign the value 1 to a variable called a (we will write the code in the console and press “enter”). Then we will do the sum a + b.

a <- 1
b <- 2

First operation

To check the installation, after opening R, you should see the R GUI (Graphical User Interface) with a white screen similar to this (console).

First code: we will assign the value 1 to a variable called a (we will write the code in the console and press “enter”). Then we will do the sum a + b.

a <- 1
b <- 2
a + b
[1] 3

Note that…

In the console, a number [1] appears: it’s simply an element counter (like counting rows in Word)

Installing R Studio

RStudio will be the Word we will use to write (what is known as an IDE: Integrated Development Environment).

  • Step 1: go to the official RStudio website (now called Posit) and select the free download.

  • Step 2: select the executable that appears according to your operating system.

  • Step 3: after downloading the executable, open it like any other and let the installation finish.

RStudio Organization

When you open RStudio you will likely have three windows:

  • Console: is the name for the large window that takes up most of your screen. Try writing the same code as before (the sum of the variables) in it. The console is where we will execute commands and display results.

RStudio Organization

When you open RStudio you will likely have three windows:

  • Environment: the small screen (you can adjust the margins with the mouse to your liking) that we have in the top right corner. It will show us the variables we have defined.

RStudio Organization

When you open RStudio you will likely have three windows:

  • Multi-purpose panel: the window at the bottom right will be used to look for function help, as well as to visualize plots.

What is R? Why R?

What is R? Why R?

R is the evolution of the work of Bell Laboratories with the S language, which was brought into the open-source world by Ross Ihaka and Robert Gentleman in the 1990s. The version R 1.0.0 was released on February 29, 2000.

What is R? Why R?

R is the statistical language par excellence, created by and for statisticians, with 6 fundamental advantages over Excel, SAS, Stata, or SPSS:

  • Programming language: the obvious → replicable analysis
  • Free: the philosophy of the R community is to share code under copyleftethical use of spending and algorithms
  • Open-source software: not only is it free, but it also allows free access to others’ code, even to the source code itselfflexibility and transparency (Free and Open Source Software FOSS)

What is R? Why R?

R is the statistical language par excellence, created by and for statisticians, with 6 fundamental advantages over Excel, SAS, Stata, or SPSS:

  • Modular language: we have installed the minimum, but there are codes from other people that we can reuse (almost 20,000 packages) → time saving and immediate innovation
  • High-level language: facilitates programming (like Python) → lower learning curve
  • Community and employability: along with Python, it is the most used language in the field of statistics and data science in research, teaching, companies (Línea Directa, Mapfre, Telefónica, Orange, Apple, Spotify, Netflix, El País, Civio, HP, etc.) and public organizations (ISCIII, CNIC, CNIO, INE, IGN, CIS, CEO, DGT, AEMET, RTVE, etc.)

Why programming?

  • Automate → it will allow you to automate recurring tasks.

  • Replicability → you will be able to replicate your analysis in the same way every time.

  • Flexibility → you will be able to adapt the software to your needs.

  • Transparency → to be audited by the community.

Fundamental Idea: Packages

One of the key ideas of R is the use of packages: codes that other people have implemented to solve a problem

  • Installation: we download the codes from the web (we need internet) → buy a book, only once (per computer)
install.packages("ggplot2")
  • Loading: with the package downloaded, we indicate which packages we want to use each time we open RStudiotake the book off the shelf
library(ggplot2)

Fundamental Idea: Packages

Once installed, there are two ways to use a package (take it off the shelf)

  • Whole package: with library(), using the package name without quotes, we load the whole book into the session
library(ggplot2)
  • Specific functions using `package::function+ we indicate that we only want a specific page of that book
ggplot2::geom_point()

You will be wrong

During your learning, it will be very common for things not to work out on the first try → you will be wrong. It will not only be important to accept it but also to read the error messages to learn from them.

  • Error messages: preceded by “Error in…” and will be those failures that prevent execution
"a" + 1 
Error in "a" + 1: non-numeric argument to binary operator
  • Warning messages: preceded by «Warning in…» they are the (possible) more delicate errors as they are inconsistencies that do not prevent execution
# Ejecuta la orden pero el resultado es NaN, **Not A Number**, un valor que no existe
sqrt(-1)
Warning in sqrt(-1): NaNs produced
[1] NaN

Scripts (.R files)

A script will be the document in which we program, our .doc file (here with a .R extension) where we will write the commands. To open our first script, click on the menu in File < New File < R Script.

Be careful

It’s important not to overuse the console: everything you don’t write in a script, when you close, will be lost.

Be careful

R is case-sensitive: it is sensitive to uppercase and lowercase, so x and X represent different variables.

Running the first script

Now we have a fourth window: the window where we will write our codes. How do we run it?

  1. Write the code to be executed.
  1. Save the .R file by clicking on Save current document.
  1. The code does not execute unless we indicate it. We have three options to run a script:
  • Copy and paste into the console.
  • Select lines and press Ctrl+Enter
  • Enable Source on Save next to save: not only saves but also executes the entire code.

Organizing: projects

Just as we usually work organized by folders on the computer, in RStudio we can do the same to work efficiently by creating projects.

A project will be a “folder” within RStudio, so our root directory will automatically be the project folder itself (allowing us to switch from one project to another using the top right menu).

We can create one in a new folder or in an existing folder.”

💻 It’s your turn

📝 Create in your computer a folder of the subject and create inside it the RStudio project: it is there where you are going to save everything that we will do along this course, after creating the project you will have an R Project file. Then create in this folder two subfolders: data (this is where you will save the different datasets that we will use) and scripts (this is where you will save the .R files of each class).

📝 Inside the project create a script Exercises-class1.R (inside the scripts folder). Once created, define in it a variable named a and whose value is -1. Execute the code in the (three) ways explained before.

Código
a <- -1

📝 Add below another line to define a variable b with the value 5. Then save the multiplication of both variables. Execute the code as you want.

Código
b <- 5
a * b # without saving it
mult <- a * b # save it

📝 Modify the code below to define two variables c and d, with values 3 and -1. Then divide the variables and save the result.

c <- # you should assign 3
d <- # you should assign -1
Código
c <- 3
d <- -1
c / d
div <- c / d

📝 Assign to x a positive value and then compute its square root; assign to y a negative number and compute its absolute value using abs().

Código
x <- 5
sqrt(x)

y <- -2
abs(y)

Note that…

Commands like sqrt(), abs() or max() are what we call functions: lines of code that we have “encapsulated” under a name, and given some input arguments, execute the commands (a sort of shortcut). In the functions the arguments will ALWAYS be enclosed in parentheses

📝 Using the variable x already defined, complete/modify the code below to store in a new variable z the result stored in x minus 5.

z <- ? - ? # complete the code
z
Código
z <- x - 5
z

📝 Define an x variable and assign it the value -1. Define another y and assign it the value 0. Then perform the operations a) x by y; b) square root of x. What do you get?

Código
x <- -1
y <- 0

x / y
sqrt(x)

📝 Write the code below in your script. Why do you think it doesn’t work?

x <- -1
y <- 0

X + y
Error in eval(expr, envir, enclos): object 'X' not found

From CELL to TABLE

What data type can we have in each cell of a table?

  • Cell: an individual piece of data of a specific type.
  • Variable: concatenation of values of the same type (vectors in R).
  • Matrix: concatenation of variables of the same type and length.
  • Table: concatenation of variables of different types but the same length
  • List: concatenation of variables of different types and different lengths

But first…best practices

Before we continue, it’s important to know something as soon as possible: starting with programming can be frustrating

Just like when learning a new language, the first obstacle is not so much what to say but how to say it correctly. The same goes for R, so let’s standardize our programming style as much as possible to avoid future errors.

  • Tip 1: assignment, evaluation, and comparison are not the same. If you’ve noticed in R, we use <- to assign values to variables. We use = to evaluate function arguments and == to check if two elements are equal.
x <- 1 # asign
x = 1 # evaluation
x == 1 # comparison

But first…best practices

  • Tip 2: program like you write. Just like when writing in Spanish, get used to incorporating spaces and line breaks to avoid making your code hard to read (it’s a good practice, not a requirement, because R does not process spaces).
x <- 1 # optimal
x<-1 # meh
x<- 1 # worst (make up your mind)
  • Tip 3: don’t be chaotic, standardize names. Always get used to naming variables consistently. The only requirement is that they must always start with a letter (and without accents). The most recommended form is snake_case.
variable_in_snake_case
anotherHarderToReadFormat
there.are.people.who.use.this
Even_People_Here.Confusing_That_Do_Not_Deserve_Our_ATTENTION

But first…best practices

  • Tip 4: make reading and writing easier, set limits. In Tools < Global Options, you can customize some options in RStudio. In Code < Display, you can set Show margin to display an “imaginary” margin (not interacting with the code) to “force” you to make line breaks.

But first…best practices

  • Tip 5: the tab key is your best friend. In RStudio, there’s a wonderful tool: if you type part of a variable or function name and press tab, RStudio will autocomplete it for you.

But first…best practices

  • Tip 6: no single parentheses. Whenever you open a parenthesis, you must close it. To make this task easier, go to Tools < Global Options < Code < Display and enable the Rainbow parentheses option.

But first…best practices

  • Tip 7: pay attention to the left side. You will not only see the line of code you are on but also, in case of a syntax error, RStudio will notify you.
  • Tip 8: try to always work by projects (for this class, create a script class1.R in the project we created before)

 

See more tips at https://r4ds.had.co.nz/workflow-basics.html#whats-in-a-name

Cells: data types

Are there variables beyond numbers in data science? For example, think about the data you might store about a person:

  • Age or weight will be a number.
age <- 33
  • Their name will be a string of text (known as string or char).
name <- "javi"
  • The answer to the question “Are you enrolled in the Faculty?” will be what we call a logical variable (TRUE if enrolled or FALSE otherwise).
enrolled <- TRUE
  • Their date of birth will be precisely that, a date.

Numerical variables

The simplest data (which we’ve already used) will be numeric variables. To find out the data class in R of a variable, we use the class() function.

a <- 5

Numerical variables

The simplest data type (we have already used it) will be the numeric variables. To know the data class in R of a variable we have the function class().

a <- 5
class(a)

To know its typology (format) variable we have typeof().

typeof(1) # 1 value but stored as a real number (double precision)
[1] "double"
typeof(as.integer(1)) # 1 value but stored as a floor number
[1] "integer"

Note that…

In R we have a collection of functions starting with as.x() that serve as conversion functions: a data that was of one type, we convert it to type x.

Numerical variables

In addition to the “common” numbers we will have the plus/minus infinity coded as Inf or -Inf.

1/0
[1] Inf
-1/0
[1] -Inf

And values that are not real numbers not a number (indeterminacies, complexes numbers, etc) encoded as NaN.

0/0
[1] NaN
sqrt(-2)
[1] NaN

Numerical variables

With numeric variables we can perform the arithmetic operations of a calculator: adding (+)…

a + b
[1] 7

…square root (sqrt())…

sqrt(a)
[1] 2.236068

… power (^2, ^3)…

a^2
[1] 25

…absolute value (abs()), etc.

abs(a)
[1] 5

String variables

Let us imagine that, in addition to the age of a person we want to store his/her name: now the variable will be of type character.

name <- "Javi"
class(name)
[1] "character"

The text strings are a type with which we obviously cannot perform arithmetic operations (other operations such as pasting or locating patterns can be performed).

name + 1 # error when we try to sum 1 to a text
Error in name + 1: non-numeric argument to binary operator

Reminder

Text variables (character or string) are ** ALWAYS in quotes**: TRUE (logical, binary value) is not the same as "TRUE" (text).

First function: paste

As we have commented R we will call function a piece of encapsulated code under a name, and which depends on some input arguments. Our first function will be paste(): given two strings, it allows us to paste them together.

paste("Javi", "Álvarez")
[1] "Javi Álvarez"

Note that default pastes strings with a space, but we can add an optional argument to tell it the separator (in sep = ...).

paste("Javi", "Álvarez", sep = "*")
[1] "Javi*Álvarez"

First function: paste

How do I know what arguments does a function need?

By typing ? paste in the console, you will get a help in the multipurpose panel, where you can see in its header what arguments the function already has default arguments assigned to it.

There is a similar function called paste0() that pastes by default with sep = “” (without anything).

paste0("Javi", "Álvarez")
[1] "JaviÁlvarez"

First function: paste

The arguments (and their detail) can also be consulted by tabulating (after a comma).

Functions: default arguments

It is very important to understand the concept of default argument of a function in R: it is a value that the function uses but sometimes we may not see because already has a value assigned.

# Same
paste("Javi", "Álvarez")
[1] "Javi Álvarez"
paste("Javi", "Álvarez", sep = " ")
[1] "Javi Álvarez"

Fíjate que...

The = operator is reserved for assigning arguments within functions. For all other assignments, we will use <-.

First package: glue

A more intuitive way to work with text is to use the {glue} package: the first thing to do is to “buy the book” (if we have never done it before). After that load the package

install.packages("glue") # just the first time
library(glue)

With the glue() function of that package we can use variables inside strings. For example, “age is … years old”, where the age is stored in a variable.

age <- 34
glue("I am {age} old")
I am 34 old

Within the keys we can also execute operations

units <- "days"
glue("I am {age * 365} {units} old")
I am 12410 days old

Logical variables

Another fundamental type will be the logical or binary variables (two values):

  • TRUE: true stored internally as a 1.

  • FALSE: false stored internally as a 0.

single <- FALSE # Single? --> NO
class(single)
[1] "logical"

Since they are stored internally as binary variables, we can perform arithmetic operations on them

2 * TRUE
[1] 2
FALSE - 1
[1] -1

Logical variables

As we will see shortly, logical variables can actually take a third value: NA or missing data, representing not available, and it will be very common to find it within a database.

missing <- NA
missing + 1
[1] NA

Important

Logical variables NOT text variables: "TRUE" is a text, TRUE is a logical value.

TRUE + 1
[1] 2
"TRUE" + 1
Error in "TRUE" + 1: non-numeric argument to binary operator

Logical conditions

Logical values are usually the result of evaluate logical conditions. For example, imagine that we want to check whether a person is named Javi.

name <- "María"

With the logical operator == we ask if what we have stored on the left is same as what we have on the right: we ASK

name == "Javi"
[1] FALSE

With its opposite != we ask if different.

name != "Javi"
[1] TRUE

Note that…

It is not the same <- (assignment) as == (we are asking, it is a logical comparison).

Logical conditions

In addition to “equal to” versus “different” comparisons, also order comparisons such as less than <, greater than >, <= or >=. Is the person less than 32 years old?

age <- 34
age < 32 # less than 32 years old?
[1] FALSE

Age is greater than or equal to 38 years?

age >= 38
[1] FALSE

Is the saved name equal to Javi?

name <- "Javi"
name == "Javi"
[1] TRUE

Date variables

A very special data type: the date type data.

date_char <- "2021-04-21"

It looks like a simple text string but should represent an instant in time. What should happen if we add a 1 to a date?

date_char + 1
Error in date_char + 1: non-numeric argument to binary operator

Dates cannot be string/text: we must convert the text string to date.

 

To work with dates we will use the {lubridate} package, which we must install before we can use it.

install.packages("lubridate")

Date variables

Once installed, of all the packages (books) that we have, we will indicate it to load this one concretely.

library(lubridate) 

To convert to date type we will use the as_date() function of the {lubridate} package (default in yyyy-mm-dd format).

 

# it's not a date, it's a text!
date_char + 1
Error in date_char + 1: non-numeric argument to binary operator
class(date_char)
[1] "character"
date <- as_date("2023-03-28")
date + 1
[1] "2023-03-29"
class(date)
[1] "Date"

Date variables

In as_date() the default date format is yyyy-mm-dd so if the string is not entered correctly…

as_date("28-08-2024")
[1] NA

For any other format we must specify it in the optional argument format = ... such that %d represents days, %m months, %Y in 4-year format and %y in 2-year format.

as_date("28-03-2023", format = "%d-%m-%Y")
[1] "2023-03-28"
as_date("28-03-23", format = "%d-%m-%y")
[1] "2023-03-28"
as_date("03-28-2023", format = "%m-%d-%Y")
[1] "2023-03-28"
as_date("28/03/2023", format = "%d/%m/%Y")
[1] "2023-03-28"

Date variables

In this package we have very useful functions for date management:

  • With today() we can directly obtain the current date.
today()
[1] "2024-10-20"
  • With now() we can obtain current date and time
now()
[1] "2024-10-20 18:04:37 CEST"
  • With year(), month() or day() we can extract year, month and day
date_today <- today()
year(date_today)
[1] 2024
month(date_today)
[1] 10

Cheatsheets

More information

You have a pdf summary of the most important packages in the corresponding folder on campus

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Define a variable that stores your age (called age) and another with your name (called name).

Código
age <- 34
name <- "Javi"

📝 Check with this variable age if it is NOT 60 years old or if it is called "Ornitorrinco" (you must obtain logical variables as a result).

Código
age != 60 # different to
name == "Ornitorrinco" # equal to

📝 Why does the lower code give an error?

age + name
Error in age + name: non-numeric argument to binary operator

📝 Define another variable called siblings that answers the question “do you have siblings?” and another variable that stores your date of birth (called birth_date).

Código
siblings <- TRUE

library(lubridate) # if not before
birth_date <- as_date("1989-09-10")

📝 Define another variable with your last name (called surname) and use glue() to have, in a single variable called full_name, your first and last name separated by a comma.

Código
surname <- "Álvarez Liébana"
full_name <- glue("{name}, {surname}")
full_name

📝 From birth_date extract the month.

Código
month(birth_date)

📝 Calculate the days that have passed since your birth date until today (with the birth date defined in Exercise 4).

Código
today() - birth_date

L2 : databases

Concatenating cells: vectors. First databases

Vectors: concatenation

When working with data, we often have columns that represent variables: we will refer to these as vectors, which are a concatenation of cells (values) of the same type (similar to a column in a table).

The simplest way to create a vector is with the c() function (c stands for concatenate), and you just need to input the elements within parentheses, separated by commas.

ages <- c(32, 27, 60, 61)
ages
[1] 32 27 60 61

Consejo

An individual number x <- 1 (or x <- c(1)) is actually a vector of length one –> everything we know how to do with a number, we can do with a vector of numbers.

Vectors: concatenation

As you can see now in the environment, we have a collection of elements stored.

ages # ages = edades in spanish
[1] 32 27 60 61

The length of a vector can be calculated with length().

length(ages)
[1] 4

We can also concatenate vectors together (it repeats them one after another).

c(ages, ages, 8)
[1] 32 27 60 61 32 27 60 61  8

Numeric sequences

The most common type of vector is numeric, specifically, the well-known numeric sequences (e.g., the days of the month), used among other things, to index loops.

The seq(start, end) function allows us to create a [**numeric sequence]**{.hl-yellow} from a starting element to an ending one, advancing one by one.

seq(1, 31)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31

Note that if we try this with characters, it won’t work since there is no predefined order among text strings.

"a":"z"
Error in "a":"z": NA/NaN argument

Numeric sequences

A shortcut is the 1:n command, which returns the same as seq(1, n).

1:7
[1] 1 2 3 4 5 6 7

If the starting element is greater than the ending one, it understands that the sequence is in descending order.

7:-3
 [1]  7  6  5  4  3  2  1  0 -1 -2 -3

We can also define a different step between consecutive elements with the by = ... argument.

seq(1, 7, by = 0.5) # seq from 1 to 7, with a step of 0.5
 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

Numeric sequences

Sometimes we may want to define a sequence with a specific length.

seq(1, 50, l = 7) # seq from 1 to 50 with length equal to 7
[1]  1.000000  9.166667 17.333333 25.500000 33.666667 41.833333 50.000000

We might also want to generate a vector of n repeated elements.

rep(0, 7) # vector of 7 0's
[1] 0 0 0 0 0 0 0

Since they are internally stored as numbers, we can also do this with dates.

seq(as_date("2023-09-01"), as_date("2023-09-10"), by = 1)
 [1] "2023-09-01" "2023-09-02" "2023-09-03" "2023-09-04" "2023-09-05"
 [6] "2023-09-06" "2023-09-07" "2023-09-08" "2023-09-09" "2023-09-10"

String vectors

A vector is a concatenation of elements of the same type, but they don’t necessarily have to be numbers. Let’s create a sample sentence.

sentence <- "My name is Javi"
sentence
[1] "My name is Javi"
length(sentence)
[1] 1

In the previous case, it wasn’t a vector, it was a single text element. To create a vector, we need to use c() again and separate elements with commas.

sentence <- c("My", "name", "is", "Javi")
sentence
[1] "My"   "name" "is"   "Javi"
length(sentence)
[1] 4

String vectors

What will happen if we concatenate elements of different types?

c(1, 2, "javi", "3", TRUE)
[1] "1"    "2"    "javi" "3"    "TRUE"

Note that since all elements must be of the same type, what R does is convert everything to text, violating the data integrity.

c(3, 4, TRUE, FALSE)
[1] 3 4 1 0

It’s important to understand that logical values are actually internally stored as 0/1.

Operations with vectors

With numeric vectors, we can perform the same arithmetic operations as with numbers → a number is a vector (of length one).

What will happen if we add or subtract a value to a vector?

x <- c(1, 3, 5, 7)
x + 1
[1] 2 4 6 8
x * 2
[1]  2  6 10 14

Warning

Unless otherwise specified, in R, vector operations are always element by element.

Adding vectors

Vectors can also interact with each other, so we can define, for example, vector sums (element by element).

x <- c(2, 4, 6)
y <- c(1, 3, 5)
x + y
[1]  3  7 11

Since the operation (e.g., a sum) is performed element by element, what will happen if we add two vectors of different lengths?

z <- c(1, 3, 5, 7)
x + z
[1]  3  7 11  9

What it does is recycle elements: if we have a vector of 4 elements and we add another with 3 elements, it will recycle the elements from the shorter vector.

Comparing vectors

A very common operation is to ask questions of the data using logical conditions. For example, if we define a vector of temperatures…

Which days were below 22 degrees?

x <- c(15, 20, 31, 27, 15, 29)
x < 22
[1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

This will return a logical vector, depending on whether each element meets the given condition (of the same length as the vector being queried).

If we had a missing value (due to a sensor error that day), the evaluated condition would also be NA.

y <- c(15, 20, NA, 31, 27, 7, 29, 10)
y < 22
[1]  TRUE  TRUE    NA FALSE FALSE  TRUE FALSE  TRUE

Comparing vectors

Logical conditions can be combined in two ways:

  • Intersection: all concatenated conditions must be met (AND conjunction with &) to return TRUE.
x < 30 & x > 15
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE
  • Union: it is enough for at least one condition to be met (OR conjunction with |).
x < 30 | x > 15
[1] TRUE TRUE TRUE TRUE TRUE TRUE

With any() and all(), we can check if all elements satisfy the condition.

any(x < 30)
[1] TRUE
all(x < 30)
[1] FALSE

Getting elements

Another common operation is accessing or getting elements. The simplest way is to use the [i] operator (access the i-th element).

ages <- c(20, 30, 33, NA, 61) 
ages[3] # get the age's third person
[1] 33

Since a number is just a vector of length one, this operation can also be applied using a vector of indices to select.

y <- c("hi", "how", "are", "you", "?")
y[c(1:2, 4)] # first, second and fourth element
[1] "hi"  "how" "you"

Consejo

To access the last element without worrying about its position, you can pass the vector’s length as the index x[length(x)].

Removing elements

Sometimes, instead of selecting, we may want to remove elements. This is done with the same operation but using negative indexing: the opetator [-i] «un-select» the i-th element

y
[1] "hi"  "how" "are" "you" "?"  
y[-2] # everything except the second element
[1] "hi"  "are" "you" "?"  

In many cases, we want to select or remove elements based on logical conditions, depending on the values, so we will pass the condition itself as the index (remember, x < 2 returns a logical vector).

ages <- c(15, 21, 30, 17, 45)
names <- c("javi", "maría", "sandra", "carla", "luis")
names[ages < 18] # names of people under 18
[1] "javi"  "carla"

Stats operations

We can also make use of statistical operations, such as sum(), which, given a vector, returns the sum of all its elements.

x <- c(1, -2, 3, -1)
sum(x)
[1] 1

What happens when a data point is missing?

x <- c(1, -2, 3, NA, -1)
sum(x)
[1] NA

By default, if we have a missing data point, the operation will also result in a missing value. To ignore that missing data, we use the optional argument na.rm = TRUE.

sum(x, na.rm = TRUE)
[1] 1

Stats operations

As we’ve mentioned, logical values are internally stored as 0 and 1, so we can use them in arithmetic operations.

For example, if we want to find out the number of elements that meet a condition (e.g., less than 3), those that do will be assigned a 1 (TRUE), and those that don’t will get a 0 (FALSE). Therefore, summing the logical vector will give us the number of elements that meet the condition.

x <- c(2, 4, 6)
sum(x < 3)
[1] 1

Stats operations

Another common operation that can be useful is the cumulative sum with cumsum(), which, given a vector, returns a vector where each element is the sum of the first, the first plus the second, the first plus the second plus the third, and so on.

x <- c(1, 5, 2, -1, 8)
cumsum(x)
[1]  1  6  8  7 15

What happens when a data point is missing?

x <- c(1, -2, 3, NA, -1)
cumsum(x)
[1]  1 -1  2 NA NA

In the case of the cumulative sum, what happens is that from that point onward, all subsequent accumulated values will be missing.

Stats operations

Another common operation that can be useful is the difference (with delay) with diff() which, given a vector, returns a vector with the second minus the first, the third minus the second, the fourth minus the third…and so on.

x <- c(1, 8, 5, 3, 9, 0, -1, 5)
diff(x)
[1]  7 -3 -2  6 -9 -1  6

Using the argument lag = we can indicate the delay of this difference (e.g. lag = 3 implies that the fourth minus the first, the fifth minus the second, etc.).

x <- c(1, 8, 5, 3, 9, 0, -1, 5)
diff(x, lag = 3)
[1]  2  1 -5 -4 -4

Stats operations

Other common operations are mean, median, percentiles, etc.

  • mean: centrality measure that consists of adding all the elements and dividing by the number of elements added. The best known but the least robust: given a set, if outliers (very large or very small values) are introduced, the mean is very easily perturbed.
x <- c(165, 170, 181, 191, 150, 155, 167, NA, 173, 177)
mean(x, na.rm = TRUE)
[1] 169.8889

Stats operations

Other common operations are mean, median, percentiles, etc.

  • Median: measure of centrality that consists of ordering the elements and keeping the one that occupies the middle.
x <- c(165, 170, 181, 191, 150, 155, 167, 173, 177)
median(x)
[1] 170
  • Quantiles: position measurements (they divide the data into equal parts).
quantile(x) # by default quantiles/percentiles 0-25-50-75-100
  0%  25%  50%  75% 100% 
 150  165  170  177  191 
quantile(x, probs = c(0.1, 0.4, 0.9))
  10%   40%   90% 
154.0 167.6 183.0 

Sorting vectors

Finally, a common action is to know sort values:

  • sort(): returns the sorted vector. By default from smallest to largest but with decreasing = TRUE we can change it.
ages <- c(81, 7, 25, 41, 65, 20, 33, 23, 77)
sort(ages)
[1]  7 20 23 25 33 41 65 77 81
sort(ages, decreasing = TRUE)
[1] 81 77 65 41 33 25 23 20  7
  • order(): returns the index vector that we would have to use to have the vector ordered
order(ages)
[1] 2 6 8 3 7 4 5 9 1
ages[order(ages)]
[1]  7 20 23 25 33 41 65 77 81

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Define the vector x as the concatenation of the first 5 odd numbers. Calculate the length of the vector

Código
# Two ways
x <- c(1, 3, 5, 7, 9)
x <- seq(1, 9, by = 2)

length(x)

📝 Access the third element of x. Access the last element (regardless of length, a code that can always be executed). Delete the first element.

Código
x[3]
x[length(x)]
x[-1]

📝 Get the elements of x greater than 4. Calculate the vector 1/x and store it in a variable.

Código
x[x > 4]
z <- 1/x
z

📝 Create a vector representing the names of 5 people, one of whom is unknown.

Código
names <- c("Javi", "Sandra", NA, "Laura", "Carlos")
names

📝 Find from the vector x of exercises above the elements greater (strictly) than 1 and less (strictly) than 7. Find a way to find out if all the elements are positive or not.

Código
x[x > 1 & x < 7]
all(x > 0)

📝 Given the vector x <- c(1, -5, 8, NA, 10, -3, 9), why does its mean return not a number but what is shown in the code below?

x <- c(1, -5, 8, NA, 10, -3, 9)
mean(x)
[1] NA

📝 Given the vector x <- c(1, -5, 8, NA, 10, -3, 9), extract the elements occupying the locations 1, 2, 5, 6.

Código
x <- c(1, -5, 8, NA, 10, -3, 9)
x[c(1, 2, 5, 6)]
x[-2]

📝 Given the x vector of the previous exercise, which ones have a missing data? Hint: the is.something() functions check if the element is of type something (press tab).

Código
is.na(x)

📝 Define the vector x as the concatenation of the first 4 even numbers. Calculate the number of elements of x strictly less than 5.

Código
x[x < 5] 
sum(x < 5)

📝 Calculate the vector 1/x and obtain the ordered version (from smallest to largest) in the two possible ways

Código
z <- 1/x
sort(z)
z[order(z)]

📝 Calculate min and max of previous x vector

Código
min(x)
max(x)

📝 Find of the vector x the elements greater (strictly) than 1 and less (strictly) than 6. Find a way to find out if all the elements are negative or not.

Código
x[x > 1 & x < 7]
all(x > 0)

More with string variables

Although we cannot do arithmetic operations with them, some operations we can do with the text strings will be important.

For that we will use in the future the {stringr} package (within the same {lubridate} “universe of packages”, which we will talk about later).

library(stringr)

# Find a correct phone format
phone_number <- c("611093", "292039", "628810585", "600917043")
str_detect(phone_number, pattern = "[6]{1}[0-9]{8}")
[1] FALSE FALSE  TRUE  TRUE

First databases

When analyzing data we usually have several variables for each individual: we need a “table” to collect them. The most immediate option is matrices: concatenation of variables of same type and equal length.

Imagine we have heights and weights of 4 people. How to create a dataset with the two variables?

The most common option is to use cbind(): concatenate (bind) vectors in the form of columns (c)

h <- c(150, 160, 170, 180)
w <- c(63, 70, 85, 95)
data_mat <- cbind(h, w)
data_mat 
       h  w
[1,] 150 63
[2,] 160 70
[3,] 170 85
[4,] 180 95

First databases

We can also build the matrix by rows with the rbind() function (concatenate - bind - by rows - r), although it is recommended to have each variable in column and individual in row as we will see later.

rbind(h, w) # Matrix by rows
  [,1] [,2] [,3] [,4]
h  150  160  170  180
w   63   70   85   95
  • We can “view” the matrix with View(matrix).
  • We can check dimensions with dim(), nrow() and ncol(): matrices are a type of tabular data (organized in rows and columns).
dim(data_mat)
[1] 4 2
nrow(data_mat)
[1] 4
ncol(data_mat)
[1] 2

First databases

We can also “flip” (transposed matrix) with t().

t(data_mat)
  [,1] [,2] [,3] [,4]
h  150  160  170  180
w   63   70   85   95

Since we now have two dimensions in our data, to access elements with [] we must provide two comma-separated indexes: row and column indexes

data_mat[2, 1] # second row, first column
  h 
160 
data_mat[1, 2] # first row, second column
 w 
63 

First databases

In some cases we will want to get the total data for an individual (a particular row but all columns) or the values of a whole variable for all individuals (a particular column but all rows). To do so, we leave one of the indexes unfilled.

data_mat[2, ] # second individual
  h   w 
160  70 
data_mat[, 1] # first variable
[1] 150 160 170 180

Much of what we have learned with vectors we can do with matrices, so we can for example access multiple rows and/or columns using the sequences of integers 1:n

data_mat[c(1, 3), 1] # first variable for first and third individual
[1] 150 170

First databases

We can also define a matrix from a numeric vector, rearranging the values in the form of a matrix (knowing that the elements are placed by columns).

z <- matrix(1:9, ncol = 3) 
z
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

We can even define an array of constant values, e.g. of zeros (to be filled later)

matrix(0, nrow = 2, ncol = 3)
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

Matrix operations

With matrices it is the same as with vectors: when we apply an arithmetic operation we do it element by element

z/5
     [,1] [,2] [,3]
[1,]  0.2  0.8  1.4
[2,]  0.4  1.0  1.6
[3,]  0.6  1.2  1.8

To perform operations in a matrix sense we must add %%%, for example, to multiply matrices it will be %*%.

z * t(z)
     [,1] [,2] [,3]
[1,]    1    8   21
[2,]    8   25   48
[3,]   21   48   81
z %*% t(z)
     [,1] [,2] [,3]
[1,]   66   78   90
[2,]   78   93  108
[3,]   90  108  126

Matrix operations

We can also perform operations by columns/rows without loops with the apply() function, and we will indicate as arguments

  • the matrix
  • the sense of the operation (MARGIN = 1 for rows, MARGIN = 2 for columns)
  • the function to apply
  • extra arguments needed by the function

For example, to apply an average to each variable, it will be mean applied with MARGIN = 2 (same function for each column).

# Mean for each column (MARGIN = 2)
apply(data_mat, MARGIN = 2, FUN = "mean")
     h      w 
165.00  78.25 

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Modify the code below to define an x matrix of ones, with 3 rows and 7 columns.

x <- matrix(0, nrow = 2, ncol = 3)
x
Código
x <- matrix(1, nrow = 3, ncol = 7)
x

📝 To the above matrix, add 1 to each number in the matrix and divide the result by 5. Then calculate its transpose

Código
new_matrix <- (x + 1)/5
t(new_matrix)

📝 Why does the code below return such a warning message?

matrix(1:15, nrow = 4)
Warning in matrix(1:15, nrow = 4): data length [15] is not a sub-multiple or
multiple of the number of rows [4]
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12    1

📝 Define the matrix x <- matrix(1:12, nrow = 4). Then get the data of the first individual, the data of the third variable, and the element (4, 1).

Código
x <- matrix(1:12, nrow = 4)
x[1, ] # first row
x[, 3] # third column
x[4, 1] # (4, 1) element

📝 Define a matrix of 2 variables and 3 individuals such that each variable captures the height and age of 3 persons, so that the age of the second person is unknown (absent). Then calculate the mean of each variable (we should get a number!).

Código
data <- cbind("age" = c(20, NA, 25), "h" = c(160, 165, 170))
apply(data, MARGIN = 2, FUN = "mean", na.rm = TRUE) # mean by columns

📝 Why does the lower code return an error? What is wrong?

mat <- cbind("age" = c(15, 20, 25), "names" = c("javi", "sandra", "carlos"))
mat
     age  names   
[1,] "15" "javi"  
[2,] "20" "sandra"
[3,] "25" "carlos"
mat + 1
Error in mat + 1: non-numeric argument to binary operator

Second attempt: data.frame

Arrays have the same problem as vectors: if we put together data of different types, it data integrity is compromised as it converts them (see the code below: the ages and the TRUE/FALSE are converted to text).

ages <- c(14, 24, NA)
single <- c(TRUE, NA, FALSE)
names <- c("javi", "laura", "lucía")
mat <- cbind(ages, single, names)
mat
     ages single  names  
[1,] "14" "TRUE"  "javi" 
[2,] "24" NA      "laura"
[3,] NA   "FALSE" "lucía"

In fact, since they are not numbers, we can no longer perform arithmetic operations.

mat + 1
Error in mat + 1: non-numeric argument to binary operator

Second attempt: data.frame

In order to work with variables of different type we have in R what is known as data.frame: concatenation of variables of equal length but which can be of different type.

table <- data.frame(ages, single, names)
class(table)
[1] "data.frame"
table
  ages single names
1   14   TRUE  javi
2   24     NA laura
3   NA  FALSE lucía

Second attempt: data.frame

Since a data.frame is already an attempt at a database the variables are not mere mathematical vectors: they have a meaning and we can (we must) give them names that describe their meaning.

library(lubridate)
table <-
  data.frame("ages" = ages, "single" = single, "names" = names,
             "birth_date" = as_date(c("1989-09-10", "1992-04-01", "1980-11-27")))
table
  ages single names birth_date
1   14   TRUE  javi 1989-09-10
2   24     NA laura 1992-04-01
3   NA  FALSE lucía 1980-11-27

Second attempt: data.frame

We have our first data set! (strictly speaking we can’t talk about a database but for the moment it looks like one). You can visualize it by typing its name in console or with View(table).

Get variables

If we want to access its elements, being again tabulated data, we can access as in the matrices (not recommended): again we have two indexes (rows and columns, leaving free the one we don’t use)

table[2, ]  # second row (all variables)
  ages single names birth_date
2   24     NA laura 1992-04-01
table[, 3]  # third column (all individuals)
[1] "javi"  "laura" "lucía"
table[2, 1] # first variable of the second individual
[1] 24

But it also has the advantages of a database : we can access the variables by name (recommended since the variables can change position and now they have a meaning), putting the name of the table followed by the symbol $ (with the tab, a menu of columns to choose from will appear).

Ask functions

  • names(): shows us the variable names
names(table)
[1] "ages"       "single"     "names"      "birth_date"
  • dim(): shows dimensions (also nrow() and ncol())
dim(table)
[1] 3 4
  • Variables can be accessed by name
table[c(1, 3), "names"]
[1] "javi"  "lucía"
table$names[c(1, 3)]
[1] "javi"  "lucía"

Add a variable

If we have one already created and we want to add a column it is as simple as using the data.frame() function we have already seen to concatenate the column. Let’s add for example a new variable, the number of siblings of each individual.

# add a new column
siblings <- c(0, 2, 3)
table <- data.frame(table, "n_sib" = siblings)
table
  ages single names birth_date n_sib
1   14   TRUE  javi 1989-09-10     0
2   24     NA laura 1992-04-01     2
3   NA  FALSE lucía 1980-11-27     3

Last attempt: tibble

Tables in data.frame format have some limitations. The main one is that does not allow recursion: imagine that we define a database with heights and weights, and we want a third variable with the BMI.

data.frame("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70),
           "BMI" = weight / (height^2))
Error in data.frame(height = c(1.7, 1.8, 1.6), weight = c(80, 75, 70), : object 'weight' not found

Hereafter we will use the tibble (enhanced data.frame) format from the {tibble} package.

library(tibble)
data_tb <- 
  tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))
class(data_tb)
[1] "tbl_df"     "tbl"        "data.frame"
data_tb
# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3

Last attempt: tibble

data_tb <- 
  tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))
class(data_tb)
[1] "tbl_df"     "tbl"        "data.frame"
data_tb
# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3

Las tablas en formato tibble nos permitirá una gestión más ágil, eficiente y coherente de los data, con 4 ventajas principales:

  • Metainformation: if you look at the header, it automatically tells us the number of rows and columns, and the type of each variable
  • Recursivity: allows you to define the variables sequentially (as we have seen)

Last attempt: tibble

  • Consistency: if you access a column that does not exist, it warns you with a warning
data_tb$invent
Warning: Unknown or uninitialised column: `invent`.
NULL
  • By rows: create by rows (copy and paste from a table) with tribble().
tribble(~colA, ~colB,
        "a",   1,
        "b",   2)
# A tibble: 2 × 2
  colA   colB
  <chr> <dbl>
1 a         1
2 b         2

Consejo

The {datapasta} package allows us to copy and paste tables from web pages and simple documents.

In summary…

  • Each cell can be of a different type: numbers, text, dates, logical values, etc. A [vector is a concatenation of cells]{. hl-yellow} (the future columns of our tables) –> In R by default operations are done element to element.
  • A matrix allows us to concatenate variables of the SAME type and SAME length –> tabular data.
  • A data.frame allows us to concatenate variables of DIFFERENT type and SAME length –> we will use tibble as an enhanced database option.

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Load from the {datasets} package the airquality dataset (New York air quality variables from May through September 1973). Is the airquality dataset of type tibble? If not, convert it to tibble (look in the package documentation at https://tibble.tidyverse.org/index.html).

Código
library(tibble)
class(datasets::airquality)
airquality_tb <- as_tibble(datasets::airquality)

📝 Once converted to tibble get the name of the variables and the dimensions of the data set. How many variables are there? How many days have been measured?

Código
names(airquality_tb)
ncol(airquality_tb)
nrow(airquality_tb)

📝 Filters only the data of the fifth observation

Código
airquality_tb[5, ]

📝 Filter only the data for the month of August. How to tell it that we want only the rows that meet a specific condition?

Código
airquality_tb[airquality_tb$Month == 8, ]

# other way
var_month <- airquality_tb$Month
airquality_tb[var_month == 8, ]

📝 Select those data that are not from July or August.

Código
airquality_tb[airquality_tb$Month != 7 & airquality_tb$Month != 8, ]
airquality_tb[!(airquality_tb$Month %in% c(7, 8)), ]

📝 Modify the following code to keep only the ozone and temperature variables (no matter what position they are).

airquality_tb[, 3]

📝 Select the temperature and wind data for August.

Código
airquality_tb[airquality_tb$Month == 8, c("Temp", "Wind")]

📝 Translate the name of the variables into your native language.

Código
names(airquality_tb) <- c("ozono", "rad_solar", "viento", "temp", "mes", "dia") 

🐣 Case study I

In the {datasets} package (already installed by default) we have several datasets and one of them is airquality. Below I have extracted 3 variables from that dataset (note that it is done with data$variable, that dollar will be important in the future).The data captures daily measurements (n = 153 observations) of air quality in New York, from May to September 1973. Six 6 variables were measured: ozone levels, solar radiation, wind, temperature, month and day.

library(datasets)
temperature <- airquality$Temp
month <- airquality$Month
day <- airquality$Day

Try to answer the questions posed in the workbook

🐣 Case study II

We will consider the surveys.RData file in which we have all poll surveys for Spain from 1982 to 2019.

load(file = "./data/surveys.RData")
survey_data
# A tibble: 139,944 × 8
   date_elec  pollster field_date_from field_date_to exit_poll  size party  
   <date>     <chr>    <date>          <date>        <lgl>     <dbl> <chr>  
 1 1982-10-28 PSOE     1982-10-28      1982-10-28    TRUE      85300 UCD    
 2 1982-10-28 PSOE     1982-10-28      1982-10-28    TRUE      85300 PSOE   
 3 1982-10-28 PSOE     1982-10-28      1982-10-28    TRUE      85300 PCE    
 4 1982-10-28 PSOE     1982-10-28      1982-10-28    TRUE      85300 AP     
 5 1982-10-28 PSOE     1982-10-28      1982-10-28    TRUE      85300 CIU    
 6 1982-10-28 PSOE     1982-10-28      1982-10-28    TRUE      85300 PA     
 7 1982-10-28 PSOE     1982-10-28      1982-10-28    TRUE      85300 EAJ-PNV
 8 1982-10-28 PSOE     1982-10-28      1982-10-28    TRUE      85300 HB     
 9 1982-10-28 PSOE     1982-10-28      1982-10-28    TRUE      85300 ERC    
10 1982-10-28 PSOE     1982-10-28      1982-10-28    TRUE      85300 EE     
# ℹ 139,934 more rows
# ℹ 1 more variable: estimation <dbl>

Try to answer the questions posed in the workbook

L3: if-else and Quarto

Flow structures: if-else and loops. Functions in R. Quarto

Flow structures

A flow or control structure consists of a series of commands oriented to decide the path that your code must follow

  • If condition A is met, what happens?

  • What if B happens?

  • How can I repeat the same expression (depending on a variable)?

If you have programmed before, you may be familiar with what are known as conditional structures such as if (bla bla) {...} else {...} or loops for/while (to be avoided whenever possible).

If

One of the most famous control structures are those known as conditional structures if.

IF a set of conditions is met (TRUE), then execute whatever is inside the curly brackets.

For example, the structure if (x == 1) { code A } what it will do is execute code A in braces but ONLY IF the condition in brackets is true (only if x is 1). In any other case, it will do nothing

For example, let’s define a vector of ages of 8 people

ages <- c(14, 17, 24, 56, 31, 20, 87, 73)
ages < 18
[1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

If

Our conditional structure will do the following: if there is a minor, it will print a message.

if (any(ages < 18)) { 
  
  print("There is a minor")
  
}
[1] "There is a minor"

If

if (any(ages < 18)) { 
  
  print("There is a minor")
  
}

In case the conditions are not true inside if() (FALSE), nothing happens.

if (all(ages >= 18)) { 
  
  print("All of them are of legal age")
  
}

We get no message because the condition all(ages >= 18) is not TRUE, so it does not execute anything.

If-else

The structure if (condition) { code A } can be combined with an else { code B }: when the condition is not checked, it will [execute the alternative code B]{. hl-yellow} inside else { }, allowing us to decide what happens when it is satisfied and when it is not

For example, if (x == 1) { code A } else { code B } will execute A if x is equal to 1 and B in any other case.

if (all(ages >= 18)) { 
  
  print("All of them are of legal age")
  
} else {
  
  print("There is a minor")
}
[1] "There is a minor"

If-else

Esta estructura if - else puede ser anidada: imagina que queremos ejecutar un código si todos son menores; si no sucede, pero todos son mayores de 16, hacer otra cosa; en cualquier otra cosa, otra acción.

if (all(ages >= 18)) { 
  
  print("All of them are of legal age")
  
} else if (all(ages >= 16)) {
  
  print("There is a minor but all of them are greater or equal to 16 years old")
  
} else { print("There are any persons under 16 years of age") }
[1] "There are any persons under 16 years of age"

Tip

You can collapse the structures by clicking on the left arrow in your script.

If-else vectorized

This conditional structure can be vectorized (in a single line) with if_else() (from the {dplyr} package), whose arguments are

  • the condition to evaluate

  • what happens when it is met and when not

  • an optional argument for when the condition to evaluate is NA

We will label without are greater/lesser and an unknown when we don’t know.

library(dplyr)
ages <- c(NA, ages)
if_else(ages >= 18, "legal age", "minor", missing = "unknown")
[1] "unknown"   "minor"     "minor"     "legal age" "legal age" "legal age"
[7] "legal age" "legal age" "legal age"

In R base there is ifelse(): it does not let you specify what to do with the absent ones but allows you to specify different types of data in TRUE and FALSE.

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 What will be the output of the following code?

if_else(sqrt(9) < 2, sqrt(9), 0)
Código
The output is 0 since sqrt(9) equals 3, and since it is not less than 2, it returns the second argument which is 0.

📝 What will be the output of the following code?

x <- c(1, NA, -1, 9)
if_else(sqrt(x) < 2, 0, 1)
Código
The output is the vector c(0, NA, NA, 1) since sqrt(1) is less than 2, sqrt(9) is not, and in the case of both sqrt(NA) (root of absent) and sqrt(-1) (returns NaN, not a number), its square root cannot be checked whether it is less than 2 or not, so the output is NA.

📝 Modify the code below so that, when the square root of a number cannot be verified to be less than 2, it returns -1.

x <- c(1, NA, -1, 9)
if_else(sqrt(x) < 2, 0, 1)
Código
x <- c(1, NA, -1, 9)
if_else(sqrt(x) < 2, 0, 1, missing = -1)

📝 What are the values of x and y of the lower code for z <- 1, z <- -1 and z <- -5?

z <- -1
if (z > 0) {
  
  x <- z^3
  y <- -sqrt(z)
  
} else if (abs(z) < 2) {
  
  x <- z^4
  y <- sqrt(-z)
  
} else {
  
  x <- z/2
  y <- abs(z)
  
}
Código
In the first case x = 1 and y = -1. In the second case x = 1 and y = 1. In the third case -1 and 2.

📝 What will happen if we execute the code below?

z <- "a"
if (z > 0) {
  
  x <- z^3
  y <- -sqrt(z)
  
} else if (abs(z) < 2) {
  
  x <- z^4
  y <- sqrt(-z)
  
} else {
  
  x <- z/2
  y <- abs(z)
  
}
Código
# will give error since it is not a numeric argument
Error in z^3 : non-numeric argument to binary operator

📝 From the {lubridate} package, the hour() function returns the time of a given date, and the now() function returns the date and time of the current time. With both functions, have cat() (cat()) print “good night” only after 21:00.

Código
# loading library
library(lubridate)

# Current date-time
current_dt <- now()

# If structure
if (hour(current_dt) > 21) {
  
  cat("Good night") # print or cat (two ways of printing)
}

Communicate: rmd and Quarto

One of the main strengths of R is the easiness to generate reports, books, webs, notes and even slides (this same material for example). To do this, install before

  • the {rmarkdown} package (to generate .rmd files)
install.packages("rmarkdown")
  • install Quarto (if you already knew R, the “new” .rmd now as .qmd)

Communicate: rmd and Quarto

So far we have only programmed in scripts (.R files) within projects, but in many occasions we will not work alone and we will need to communicate the results in different formats:

  • notes (for ourselves)
  • slides
  • web
  • reports

For all this we will use Quarto (see more in https://ivelasq.quarto.pub/intro-to-quarto/)

Communicate: rmd and Quarto

The .qmd (or .rmd before) extension files will allow us to easily combine:- Markdown:

  • typed language that allows us to create simple content (wordpress type, with text, bold, cursives, etc) with a readable layout.
  • [Math (latex)]{. hl-yellow}: language for writing mathematical notation such as \(x^2\) or \(\sqrt{y}\) or \(\int_{a}^{b} f(x) dx\).
  • Code and outputs: we can not only show the final step but also the code you have been doing (in R, Python, C++, Julia, …), with code boxes called CHUNKS.
  • Images, graphs, tables, styles (css, js), etc.

Communicate: rmd and Quarto

The main advantage of making this type of material in Quarto/Rmarkdown is that, by doing it from RStudio, you can generate a report or presentation without leaving the programming environment in which you are working. This way you can analyze the data, summarize it and at the same time communicate it with the same tool.

Recently the RStudio team developed Quarto, an improved version of Rmarkdown (.qmd files), with a slightly more aesthetic and simpler format. You have all the documentation and examples at https://quarto.org/

Quarto examples

Images obtained from https://ivelasq.quarto.pub/intro-to-quarto/#/working-with-the-rstudio-visual-editor

Our first report

We are going to create the first rmarkdown file with Quarto with extension .qmd. For this we will only need to click on

File << New File << Quarto Document

Our first report

After doing so, several output format options:

  • .pdf file

  • .html file (recommendable): dynamic document, allows user interaction, like a “web page”.

  • .doc file (not recommended).

For the moment we will leave the default HTML format checked, and we will write the title of our document. After that we will have our file .qmd (it is no longer an .R script like the ones we have opened so far).

Our first report

You should have something similar to the image capture with two editing modes: Source (with code, the recommended option until you master it) and Visual (more like a blog).

To run the WHOLE document you must click Render on Save and hit save.

Quarto output format

You should have obtained an html output similar to this (and a html file has been generated on your computer).

Editor: source vs visual

As indicated, you have two ways of working: with pure code and something similar to a Notion (blog).

Image retrieved from https://ivelasq.quarto.pub/intro-to-quarto/#/working-with-the-rstudio-visual-editor

Our first report

A .qmd file is basically divided into three parts:

  • Header: the part you have at the beginning between ---.

  • [Text]{. hl-yellow}: which we can format and enhance with bold (written as bold, with double asterisk at the beginning and end), italics (cursive, with underscore at the beginning and end) or highlight function or variable names from R. You can add equations like \(x^2\) (I have written $x^2$, between dollars).

  • R code

Header

The header is in YAML format and contains the metadata of the document.

  • title and subtitle: the title/subtitle of the document
  • author: author of the document
  • format: output format (we can customize)
    • theme: if you have any style file
    • toc: if you want index or not
    • toc-location: index position
    • toc-title: index title
  • editor: if you are in visual or source mode.
---
title: "prueba"
author: "javier álvarez liébana"
format:
  html:
editor: visual
---

Header

The header is in YAML format and contains the metadata of the document.

  • title and subtitle: the title/subtitle of the document
  • author: author of the document
  • format: output format (we can customize)
    • theme: if you have any style file
    • toc: if you want index or not
    • toc-location: index position
    • toc-title: index title
  • editor: if you are in visual or source mode.
---
title: "prueba"
author: "javier álvarez liébana"
format:
  html:
    style: style.css
    toc: true
editor: visual
---

Header

The header is in YAML format and contains the metadata of the document.

  • title and subtitle: the title/subtitle of the document
  • author: author of the document
  • format: output format (we can customize)
    • theme: if you have any style file
    • toc: if you want index or not
    • toc-location: index position
    • toc-title: index title
  • editor: if you are in visual or source mode.
---
title: "prueba"
author: "javier álvarez liébana"
format:
  html:
    style: style.css
    toc: true
    toc-location: left
editor: visual
---

Header

The header is in YAML format and contains the metadata of the document.

  • title and subtitle: the title/subtitle of the document
  • author: author of the document
  • format: output format (we can customize)
    • theme: if you have any style file
    • toc: if you want index or not
    • toc-location: index position
    • toc-title: index title
  • editor: if you are in visual or source mode.
---
title: "prueba"
author: "javier álvarez liébana"
format:
  html:
    style: style.css
    toc: true
    toc-location: left
    toc-title: Índice
editor: visual
---

Text

Regarding typing there is only one important thing: unless we indicate otherwise, EVERYTHING we are going to type is (normal) text; no R code.

We are going to start by writing a section at the beginning (# Intro and behind it, for example, the sentence

This material has been designed by Professor Javier Álvarez Liébana, professor at the Complutense University of Madrid

In addition to the Running Code we will add a # pad: the outside-chunks pads will help us create epigraphs (sections) in the document.

Index

To make the index capture those sections we will modify the header of the file as shown in the image (you can change the location of the index and the title if you want to test).

Text

Let’s customize the text a bit by doing the following:

  • We will add bold to the name (putting ** at the beginning and at the end).

  • We will add cursive to the word material (putting _ at the beginning and at the end).

  • We will add a link https://www.ucm.es, associating it to the name of the University. To do this we put the title in square brackets and just behind the link in brackets [“Universidad Complutense de Madrid”](https://www.ucm.es).

Code

To include R code we must create our code boxes called chunks: high in the path in our markdown text where we can include code from almost any language (and its outputs).

 

To include one you must go header as follows you have a shortcut Command + Option + I (Mac) or Ctrl + Shift + I (Windows)

Code

Inside this box (which now has a different color in the document) write code R as we have been doing so far in the scripts.

Let’s for example define two variables and their sum in the following way, writing this code in our .qmd (inside that chunk)

# R code
x <- 1
y <- 2
x + y
[1] 3

Running chunks

Chunks can have a name or tag, so that we can reference them again to avoid repeating code.

Running chunks

In each chunk there are two buttons:

  • play button: activates the play and exit of that particular chunk (you can view it within your own RStudio)

  • rewind button: activates the play and exit of all chunks up to that one (without reaching it).

 

In addition we can include R code inside the text line (instead of displaying the text x execute the R code displaying the variable).

Customizing chunks

The chunks can be customized with options at the beginning of the chunk preceded by #|:

  • #| echo: false: execute code and show result but does not display code in the output. hl-red} on output.

  • #| include: false: executes code but does not display result and does not display code on output.

  • #| eval: false: [does not execute code]{. hl-red}, does not display result but does display code on output.

  • #| message: false: executes code but does not display output messages.

  • #| warning: false: runs code but does not display warning messages.

  • #| error: true: runs code and allows errors displaying the error message in the output.

These options can be applied chunk by chunk or set globally with knitr::opts_chunk$set() at the beginning of the document (within a chunk).

Customizing chunks

If we want to apply the option to all chunks by default we must include it at the end of the header, as run options

---
title: "¡Hola!"
format: html
editor: visual
execute:
  echo: false
---

Organizing

In addition to text and code we can enter the following:

  • Equations: you can also add equations like \(x^2\) (I have written $x^2$, the equation between dollars).

  • [Lists]{. hl-yellow}: you can itemize elements by putting * Step 1: ... *Step 2: ...

  • Cross-references: you can tag parts of the document (the tag is constructed with {#section-name}) and then call them with [Section](@section-name).

Figures and images

Finally, we can also add captions to graphics or images by adding #| fig-cap: "...".

Notice that the caption is in the margin (for example). You can change it by entering header settings (everything about figures starts with fig-, and you can see the options by tabbing). You have more information at https://quarto.org/

Styles

Finally you can add a custom theme including a style file (.scss or .css file). I have left one for you at https://github.com/dadosdelaplace/docencia-R-master-bio-2324/tree/main/material.

Important

The style file must be in the same folder as the .qmd file.

Styles

You can also do it in a simple way adding a bit of HTML to the text. For example, to customize the color of a text it goes between square brackets and right after the text, between braces, the style options

This word is [red]{style="color:red;"} ...
... and this is [green in bold]{style="color:green; font-weight: bold;"}

This word is red

… and this is green in bold

Revealjs

You can add some “animations” using what is known as Revealjs (javascript), specifying it in the header and using blocks of that language delimited by ::: at the beginning and end, and the word of the “tool” to use. For example {.incremental} transitions the elements.

format:
  revealjs

 

::: {.incremental}
- I
- am
- Javi
:::
  • I
  • am
  • Javi

Call blocks

You can also use the callout-blocks which by default are note, tip, warning, caution and important (although you can create and customize them). To do this, just use ::{.callout-type} and the type you want to use

:::{.callout-tip}

Note that there are five types of callouts, including: 
`note`, `tip`, `warning`, `caution`, and `important`.

:::

Consejo

Note that there are five types of callouts, including: note, tip, warning, caution, and important.

Peligro

Use them wisely, sometimes a lot of aesthetic resources can be dizzying.

Multiple columns layout

With :::: columns we can define a layout of multiple columns where each one is defined by ::: {.column width=“65%”} something :::, indicating next to the percentage how much you want each column to occupy (be careful not to leave spaces!).

:::: columns
::: {.column width="65%"}
This is how to define a vector
:::
::: {.column width="35%"}
x <- c(1, 2, 3)
x
:::
::::

 

This is how to define a vector

x <- c(1, 2, 3)
x
[1] 1 2 3

Non-R code

In addition {reticulate} allows us to create python chunks inside a Quarto in R (see https://quarto.org/docs/computations/python.html to create jupyter notebooks directly from Quarto).

# install.packages("reticulate")
library(reticulate)

install_python("3.9.12") # Installing python if you have not done before

# Installing Python libraries
reticulate::py_install("numpy")
reticulate::py_install("matplotlib")
import numpy as np
import matplotlib.pyplot as plt
r = np.arange(0, 2, 0.05)
theta = 2 * np.pi * r
fig, ax = plt.subplots(
  subplot_kw = {'projection': 'polar'} 
)
ax.plot(theta, r)
plt.show()

🐣 Case study I: flow structures

To practice control structures we are going to perform a simulation exercise

 

Try to answer the questions posed in the workbook

🐣 Case study II: functions

Define a function called temperature_converter that, given a temperature in Fahrenheit, Celsius or Kelvin, converts it to any of the others

 

Try to answer the questions posed in the workbook

🐣 Case study III: mock task

Let’s perform a small simulation before delivery using the starwars dataset from the {dplyr} package.

🐣 Case study III: mock task

library(dplyr)
starwars
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

We have different variables of the Star Wars characters, with characteristics of their hair, skin, height, name, etc.

🐣 Case study III: mock task

Create a .qmd document with name, title, format and index. Each subsequent Exercise will be a subsection of the document. Run the chunks you consider and comment on the outputs to answer each question

Exercise 1. How many characters are stored in the database? How many characteristics have been measured for each one?

Exercise 2. Extract in two different variables names and ages the corresponding variables from the table. What type is the variable name? And the variable birth_year?

Exercise 3. Obtain the vector of names of the characters sorted from oldest to youngest.

🐣 Case study III: mock task

Exercise 4. Get help from the unique() function. Use it to find out what modalities the qualitative variable for eye color has. How many different ones are there?

Exercise 5. Are there ANY missing values in the eye color variable?

Exercise 6. Calculate the mean and standard deviation of the height and weight variables (watch out for missing values). Define a new tibble with these two variables and incorporate a third variable called “BMI” that calculates the body mass index. Incorporate with $ $ the formula used for BMI.

Final task: foundations of R base

Final task (70%): 10/09/2024 from 15:00 to 16:10

Final task (intro R)

The day of the submission you will have uploaded a submission template in .qmd format on campus.

  1. Unzip the folder (important! if you don’t unzip, even if you can edit the .qmd, you won’t be able to generate the . html)

  2. Edit the header with your name and ID

  3. You must fill in each chunk with the code you consider (in some I have left hints) and change from #| eval: false to #| eval: true (if you remove them directly, by default it is already true)

  4. You must comment with normal text what you consider to answer the questions.

  5. It will be MANDATORY to upload the generated .html file (only that file will be corrected) so render as you fill in the document, don’t leave it to the end.

DATA PROGRAMMING

L1: tidydata

Our favorite format: tibble. Tidyverse: an universe of tidy data

💻 Task correction

Try to perform the following exercises without looking at the solutions

Previously, in Breaking Bad…

Our final database format will be the tibble type object, an enhanced data.frame.

library(tibble)
tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))
# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3
  • Metainformation: in the header it automatically tells us the number of rows and columns, and the type of each variable.

  • Recursivity: allows to define the variables sequentially (as we have seen).

  • Consistency: if you access a column that does not exist it warns you with a warning.

  • By rows: allows to create by rows with tribble().

Previously, in Breaking Bad…

To define a tibble() ourselves we have two options:

  • Concatenating vectors that we already have defined, making use of the tibble() function of the {tibble} package (already included in {tidyverse})
height <- c(1.7, 1.8, 1.6)
weight <- c(80, 75, 70)
BMI <- weight / (height^2)
tibble("height" = height, "weight" = weight, "BMI" = BMI)
# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3

Previously, in Breaking Bad…

  • Directly in a tibble manually providing values and variable names
tibble("height" = c(1.7, 1.8, 1.6),
       "weight" = c(80, 75, 70),
       "BMI" = weight / (height^2))
# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3

R base vs Tidyverse

So far, everything we have done in R has been done in the programming paradigm known as R base. When R was born as a language, many of those who programmed in it imitated forms and methodologies inherited from other languages, based on the use of

  • Loops for and while

  • Dollar $ to access to the variables

  • Structures if-else

And although knowing these structures can be interesting in some cases, in most cases they are obsolete and we will be able to avoid them (especially loops) since R is specially designed to work in a functional way (instead of element-by-element).

What is tidyverse?

In this context of functional programming, a decade ago {tidyverse} was born, a “universe” of packages to guarantee an efficient, coherent and lexicographically simple to understand workflow, based on the idea that our data is clean and tidy.

What is tidyverse?

  • {lubridate}: date management
  • {rvest}: web scraping
  • {tidymodels}: modeling/prediction
  • {tibble}: optimizing data.frame
  • {tidyr}: data cleaning
  • {readr}: load rectangular data (.csv), {readxl}: import .xls and .xlsx files
  • {dplyr}: grammar for debugging
  • {stringr}: text handling
  • {purrr}: list handling
  • {forcats}: qualitative handling
  • {ggplot2}: data visualization

What is tidyverse?

  • {lubridate}: date management
  • {rvest}: web scraping
  • {tidymodels}: modeling/prediction
  • {tibble}: optimizing data.frame
  • {tidyr}: data cleaning
  • {readr}: load rectangular data (.csv), {readxl}: import .xls and .xlsx files
  • {dplyr}: grammar for debugging
  • {stringr}: text handling
  • {purrr}: list handling
  • {forcats}: qualitative handling
  • {ggplot2}: data visualization

Basic idea: tidy data

Tidy datasets are all alike, but every messy dataset is messy in its own way (Hadley Wickham, Chief Scientist en RStudio)

TIDYVERSE

The universe of {tidyverse} packages is based on the idea introduced by Hadley Wickham (the God we pray to) of standardizing the format of data to

  • systematize debugging
  • make it easier simpler to manipulate
  • legible code.

Rules

The first thing will therefore be to understand what the tidydata sets are, since the whole {tidyverse} is based on the data being standardized.

  1. Each variable in a single column
  1. Each individual in a different row
  1. Each cell with a single value
  1. Each dataset in a tibble
  1. If we want to join multiple datasets we must have a common (key) column.

Pipe

In {tidyverse} the operator pipe (pipe) defined as |> (ctrl+shift+M) will be key: it will be a pipe that traverses the data and transforms it. . . .

In R base, if we want to apply three functions first(), second() and third() in order, it would be

third(second(first(data)))

In {tidyverse} we can read from left to right and separate data from the actions

data |> first() |> second() |> third()

Important

Since version 4.1.0 of R we have |>, a native pipe available outside tidyverse, replacing the old pipe %>% which depended on the {magrittr} package (quite problematic).

Pipe

The main advantage is that the code is very readable (almost literal) and you can do large operations on the data with very little code.

data |>
  tidy(...) |>
  filter(...) |>
  select(...) |>
  arrange(...) |>
  modify(...) |>
  rename(...) |>
  group(...) |>
  count(...) |>
  summarise(...) |>
  plot(...)

Messy data

But what does the non-tidy (messy) data look like? Let’s load the table4a table from the {tidyr} package (we already have it loaded from the {tidyverse} environment).

library(tidyr)
table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

What could be wrong?

Pivot longer

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

❎ Each row represents two observations (1999 and 2000) → the columns 1999 and 2000 should actually themselves be values of a variable and not column names.

We will include a new column that stores the year and another one that stores the value of the variable of interest in each of those years. And we will do it with the pivot_longer() function: pivot the table to long format:

table4a |> 
  pivot_longer(cols = c("1999", "2000"), names_to = "year", values_to = "cases")
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Pivot longer

table4a |> 
  pivot_longer(cols = c("1999", "2000"),
               names_to = "year",
               values_to = "cases")
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

  • cols: name of the variables to pivot.
  • names_to: name of the new variable to which we send the header of the table (the names).
  • values_to: name of the new variable to which we are going to send the data.

Messy data

Let’s see another example with table table2.

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

What could be wrong?

Pivot wider

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

❎ Each observation is divided into two rows → the records with the same year should be the same

What we will do will be the opposite: with pivot_wider() we will widen the table

table2 |>  pivot_wider(names_from = type, values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Messy data

Let’s see another example with table table3.

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

What could be wrong?

Separate

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

❎ Each cell contains several values

What we will do is make use of the separate() function to send separate each value to a different column.

table3 |> separate(rate, into = c("cases", "pop"))
# A tibble: 6 × 4
  country      year cases  pop       
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Separate

table3 |> separate(rate, into = c("cases", "pop"))
# A tibble: 6 × 4
  country      year cases  pop       
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Notice that the data, although it has separated them, kept them as text when in fact they should be numeric variables. For this we can add the optional argument convert = TRUE.

table3 |> separate(rate, into = c("cases", "pop"), convert = TRUE)
# A tibble: 6 × 4
  country      year  cases        pop
  <chr>       <dbl>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Messy data

Let’s see the last example with table table5.

table5
# A tibble: 6 × 4
  country     century year  rate             
  <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

What could be wrong?

Unite

table5
# A tibble: 6 × 4
  country     century year  rate             
  <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

❎ We have same values divided in two columns

We will use unite() to unite the values of century and year in the same column

table5 |> unite(col = whole_year, century, year, sep = "")
# A tibble: 6 × 3
  country     whole_year rate             
  <chr>       <chr>      <chr>            
1 Afghanistan 1999       745/19987071     
2 Afghanistan 2000       2666/20595360    
3 Brazil      1999       37737/172006362  
4 Brazil      2000       80488/174504898  
5 China       1999       212258/1272915272
6 China       2000       213766/1280428583

Nest data

We can also nest datasets inside another one: imagine we have a dataset with variables x and y, with two records, another with one and another with 3 of them.

data <-
  tibble("dataset" = c(1, 1, 2, 3, 3, 3), 
         "x" = c(0, 2, NA, -2, 6, 7),
         "y" = c(-1, NA, 5, 1.5, NA, -2))
data
# A tibble: 6 × 3
  dataset     x     y
    <dbl> <dbl> <dbl>
1       1     0  -1  
2       1     2  NA  
3       2    NA   5  
4       3    -2   1.5
5       3     6  NA  
6       3     7  -2  

In reality everything that has an equal value in dataset should form its own tibble so let’s create one inside the one we have

Nest data

For it we will use the function nest() indicating it which variables form the datasets that will be nested (in this case variables x and y). Notice that inside what it stores is a variable of type list (since each dataset has a different length).

data_nest <-
  data |>
  nest(data = c(x, y))
data_nest
# A tibble: 3 × 2
  dataset data            
    <dbl> <list>          
1       1 <tibble [2 × 2]>
2       2 <tibble [1 × 2]>
3       3 <tibble [3 × 2]>

Nest data

To unnest just use the unnest() function indicating the column to unnest.

data_nest |> unnest(cols = c(data))
# A tibble: 6 × 3
  dataset     x     y
    <dbl> <dbl> <dbl>
1       1     0  -1  
2       1     2  NA  
3       2    NA   5  
4       3    -2   1.5
5       3     6  NA  
6       3     7  -2  

Example: world bank pop

In the {tidyr} package we have the world_bank_pop dataset which contains data from the World Bank about population per country from 2000 to 2018.

library(tidyr)
world_bank_pop
# A tibble: 1,064 × 20
   country indicator      `2000`  `2001`  `2002`  `2003`  `2004`  `2005`  `2006`
   <chr>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ABW     SP.URB.TOTL    4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
 2 ABW     SP.URB.GROW    1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
 3 ABW     SP.POP.TOTL    8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
 4 ABW     SP.POP.GROW    2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
 5 AFE     SP.URB.TOTL    1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
 6 AFE     SP.URB.GROW    3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
 7 AFE     SP.POP.TOTL    4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
 8 AFE     SP.POP.GROW    2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
 9 AFG     SP.URB.TOTL    4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG     SP.URB.GROW    1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
#   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#   `2016` <dbl>, `2017` <dbl>

What could be wrong?

Example: world bank pop

First of all, we can see that we effectively have the same variable in 18 columns: population.

What we should have is a column called pop with these values and another year indicating to which year corresponds the measurement. And for this we will do it with pivot_longer().

world_bank_pop_tidy <-
  world_bank_pop |> 
  pivot_longer(cols = -(country:indicator), names_to = "year", values_to = "value")
world_bank_pop_tidy
# A tibble: 19,152 × 4
   country indicator   year  value
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

Example: world bank pop

world_bank_pop_tidy
# A tibble: 19,152 × 4
   country indicator   year  value
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

Is everything correct?

If you notice we have two types of population measures, total ...TOTL and its growth ...GROW, but in addition we have them for each country in global SP.POP... and only in urban area SP.URB....

unique(world_bank_pop_tidy$indicator)
[1] "SP.URB.TOTL" "SP.URB.GROW" "SP.POP.TOTL" "SP.POP.GROW"

Example: world bank pop

What should be done?

We will separate this variable into 3: one for the prefix SP (which we will eliminate later), one for the area (POP/URB) and one for the value (variable), which can be total or growth.

world_bank_pop_tidy2 <-
  world_bank_pop_tidy |>
  separate(indicator, c("dummy", "area", "variable"))
world_bank_pop_tidy2$dummy <- NULL
world_bank_pop_tidy2
# A tibble: 19,152 × 5
   country area  variable year  value
   <chr>   <chr> <chr>    <chr> <dbl>
 1 ABW     URB   TOTL     2000  41625
 2 ABW     URB   TOTL     2001  42025
 3 ABW     URB   TOTL     2002  42194
 4 ABW     URB   TOTL     2003  42277
 5 ABW     URB   TOTL     2004  42317
 6 ABW     URB   TOTL     2005  42399
 7 ABW     URB   TOTL     2006  42555
 8 ABW     URB   TOTL     2007  42729
 9 ABW     URB   TOTL     2008  42906
10 ABW     URB   TOTL     2009  43079
# ℹ 19,142 more rows

Example: world bank pop

This can be done in a simpler way by indicating in the variable that we want to eliminate that it is NA inside separate().

world_bank_pop_tidy <-
  world_bank_pop_tidy |>
  separate(indicator, c(NA, "area", "variable"))
world_bank_pop_tidy
# A tibble: 19,152 × 5
   country area  variable year  value
   <chr>   <chr> <chr>    <chr> <dbl>
 1 ABW     URB   TOTL     2000  41625
 2 ABW     URB   TOTL     2001  42025
 3 ABW     URB   TOTL     2002  42194
 4 ABW     URB   TOTL     2003  42277
 5 ABW     URB   TOTL     2004  42317
 6 ABW     URB   TOTL     2005  42399
 7 ABW     URB   TOTL     2006  42555
 8 ABW     URB   TOTL     2007  42729
 9 ABW     URB   TOTL     2008  42906
10 ABW     URB   TOTL     2009  43079
# ℹ 19,142 more rows

Have we got it yet? Think carefully: does each variable have its own column?

Example: world bank pop

If you actually look at the total population and growth variables, they should be different variables, since they even have different units: one is inhabitants, the other is percentage points.

To do the reverse of the initial operation, pivot_wider() (later we will use a tremendously useful function, {janitor}’s clean_names() which unifies variable names).

world_bank_pop_tidy <-
  world_bank_pop_tidy |>
  pivot_wider(names_from = "variable", values_from = "value") |> 
  janitor::clean_names()
world_bank_pop_tidy
# A tibble: 9,576 × 5
   country area  year   totl   grow
   <chr>   <chr> <chr> <dbl>  <dbl>
 1 ABW     URB   2000  41625 1.66  
 2 ABW     URB   2001  42025 0.956 
 3 ABW     URB   2002  42194 0.401 
 4 ABW     URB   2003  42277 0.197 
 5 ABW     URB   2004  42317 0.0946
 6 ABW     URB   2005  42399 0.194 
 7 ABW     URB   2006  42555 0.367 
 8 ABW     URB   2007  42729 0.408 
 9 ABW     URB   2008  42906 0.413 
10 ABW     URB   2009  43079 0.402 
# ℹ 9,566 more rows

Example: world bank pop

The complete code would be this: short, concise, readable and self-descriptive.

world_bank_pop_tidy <-
  world_bank_pop |> 
  pivot_longer(cols = -(country:indicator), names_to = "year", values_to = "value") |> 
  separate(indicator, c(NA, "area", "variable")) |> 
  pivot_wider(names_from = "variable", values_from = "value") |> 
  janitor::clean_names()
world_bank_pop_tidy
# A tibble: 9,576 × 5
   country area  year   totl   grow
   <chr>   <chr> <chr> <dbl>  <dbl>
 1 ABW     URB   2000  41625 1.66  
 2 ABW     URB   2001  42025 0.956 
 3 ABW     URB   2002  42194 0.401 
 4 ABW     URB   2003  42277 0.197 
 5 ABW     URB   2004  42317 0.0946
 6 ABW     URB   2005  42399 0.194 
 7 ABW     URB   2006  42555 0.367 
 8 ABW     URB   2007  42729 0.408 
 9 ABW     URB   2008  42906 0.413 
10 ABW     URB   2009  43079 0.402 
# ℹ 9,566 more rows

Example: who dataset

In{tidyr} package we have who2 dataset (World Health Organization dataset)

library(tidyr)
who2
# A tibble: 7,240 × 58
   country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
   <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Afghanistan  1980       NA        NA        NA        NA        NA        NA
 2 Afghanistan  1981       NA        NA        NA        NA        NA        NA
 3 Afghanistan  1982       NA        NA        NA        NA        NA        NA
 4 Afghanistan  1983       NA        NA        NA        NA        NA        NA
 5 Afghanistan  1984       NA        NA        NA        NA        NA        NA
 6 Afghanistan  1985       NA        NA        NA        NA        NA        NA
 7 Afghanistan  1986       NA        NA        NA        NA        NA        NA
 8 Afghanistan  1987       NA        NA        NA        NA        NA        NA
 9 Afghanistan  1988       NA        NA        NA        NA        NA        NA
10 Afghanistan  1989       NA        NA        NA        NA        NA        NA
# ℹ 7,230 more rows
# ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
#   sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
#   sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
#   sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
#   sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
#   sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …

Is it tidy data? Why?

Example: who dataset

First step for tidy data: we must pivot the table (tip: use paper and pen to sketch how the database should look like) so that there is a column called cases (since all columns starting from year is actually the same, cases of a disease).

who_tidy <-
  who2 |> 
  pivot_longer(cols = -(country:year), names_to = "type", values_to = "cases")
who_tidy
# A tibble: 405,440 × 4
   country      year type      cases
   <chr>       <dbl> <chr>     <dbl>
 1 Afghanistan  1980 sp_m_014     NA
 2 Afghanistan  1980 sp_m_1524    NA
 3 Afghanistan  1980 sp_m_2534    NA
 4 Afghanistan  1980 sp_m_3544    NA
 5 Afghanistan  1980 sp_m_4554    NA
 6 Afghanistan  1980 sp_m_5564    NA
 7 Afghanistan  1980 sp_m_65      NA
 8 Afghanistan  1980 sp_f_014     NA
 9 Afghanistan  1980 sp_f_1524    NA
10 Afghanistan  1980 sp_f_2534    NA
# ℹ 405,430 more rows

Example: who dataset

If you notice there are a lot of rows that do not make sense to keep because we have no cases! Investigate the pivot_longer() options to see how we can directly remove them in the pivot.

who_tidy <-
  who2 |> 
  pivot_longer(cols = -(country:year), names_to = "type", values_to = "cases",
               values_drop_na = TRUE)
who_tidy
# A tibble: 76,046 × 4
   country      year type      cases
   <chr>       <dbl> <chr>     <dbl>
 1 Afghanistan  1997 sp_m_014      0
 2 Afghanistan  1997 sp_m_1524    10
 3 Afghanistan  1997 sp_m_2534     6
 4 Afghanistan  1997 sp_m_3544     3
 5 Afghanistan  1997 sp_m_4554     5
 6 Afghanistan  1997 sp_m_5564     2
 7 Afghanistan  1997 sp_m_65       0
 8 Afghanistan  1997 sp_f_014      5
 9 Afghanistan  1997 sp_f_1524    38
10 Afghanistan  1997 sp_f_2534    36
# ℹ 76,036 more rows

Example: who dataset

Now in type we have coded the information as diagnosis_sex_age. How to separate it in 3 columns? Investigate both separate() and pivot_longer() options.

# separate
who_tidy <-
  who_tidy |> 
  separate(col = "type", into = c("diagnosis", "sex", "age"))

# pivot_longer
who_tidy <-
  who2 |> 
  pivot_longer(cols = -(country:year), names_to = c("diagnosis", "sex", "age"),
               values_to = "cases", values_drop_na = TRUE,
               names_sep = "_")
who_tidy
# A tibble: 76,046 × 6
   country      year diagnosis sex   age   cases
   <chr>       <dbl> <chr>     <chr> <chr> <dbl>
 1 Afghanistan  1997 sp        m     014       0
 2 Afghanistan  1997 sp        m     1524     10
 3 Afghanistan  1997 sp        m     2534      6
 4 Afghanistan  1997 sp        m     3544      3
 5 Afghanistan  1997 sp        m     4554      5
 6 Afghanistan  1997 sp        m     5564      2
 7 Afghanistan  1997 sp        m     65        0
 8 Afghanistan  1997 sp        f     014       5
 9 Afghanistan  1997 sp        f     1524     38
10 Afghanistan  1997 sp        f     2534     36
# ℹ 76,036 more rows

Example: who dataset

Finally, separate in two (age_inf, age_sup) the age range (which are numbers). Think about how to do it since it is not always 4 numbers (if there is no upper age range defined, put Inf as an upper bound).

who_tidy <-
  who_tidy |> 
  separate(col = "age", into = c("age_inf", "age_sup"), sep = -2, convert = TRUE)
who_tidy$age_inf <- if_else(is.na(who_tidy$age_inf), 65, who_tidy$age_inf)
who_tidy$age_sup <- if_else(who_tidy$age_sup == 65, Inf, who_tidy$age_sup)
who_tidy
# A tibble: 76,046 × 7
   country      year diagnosis sex   age_inf age_sup cases
   <chr>       <dbl> <chr>     <chr>   <dbl>   <dbl> <dbl>
 1 Afghanistan  1997 sp        m           0      14     0
 2 Afghanistan  1997 sp        m          15      24    10
 3 Afghanistan  1997 sp        m          25      34     6
 4 Afghanistan  1997 sp        m          35      44     3
 5 Afghanistan  1997 sp        m          45      54     5
 6 Afghanistan  1997 sp        m          55      64     2
 7 Afghanistan  1997 sp        m          65     Inf     0
 8 Afghanistan  1997 sp        f           0      14     5
 9 Afghanistan  1997 sp        f          15      24    38
10 Afghanistan  1997 sp        f          25      34    36
# ℹ 76,036 more rows

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Use the original who2 dataset from the {tidyr} package and try to answer the question: how many cases of tuberculosis were there in Spain in 1995 among women? Do it without converting it to tidydata. After that, pivot in a simple way and compare the code to be done when we have tidydata to when we don’t. Which one is more readable if you didn’t know R? Which one has higher error probability?

Código
# messy data
sum(who2[who2$country == "Spain" & who2$year == 1995,
     names(who2)[str_detect(names(who2), "f_")]], na.rm = TRUE)

# tidy data (at this moment)
sum(who_tidy[who_tidy$country == "Spain" &
           who_tidy$year == 1995 &
           who_tidy$sex == "f", ]$cases)

# in the future
who_tidy |> 
  filter(country == "Spain" & year == 1995 & sex == "f") |> 
  summarise(sum(cases))

📝 Using who_tidy determine which sex has had more cases, men or women? Create a new variable avg_age (mean age of the interval): if the range is 25 to 34, the mean age will be \((25 + 34)/2 = 29.5\) (if Inf above, NA)

Código
# f vs m
sum(who_tidy[who_tidy$sex == "m", ]$cases)
sum(who_tidy[who_tidy$sex == "f", ]$cases)

# ave age
who_tidy$ave_age <- 
  if_else(is.infinite(who_tidy$age_sup), NA, (who_tidy$age_inf + who_tidy$age_sup)/2)

📝 If we must choose a country in which we have the lowest probability of infection, which country, between the United Kingdom (United Kingdom of Great Britain and Northern Ireland) and France (similar population), had the fewest cases in the most recent year (whichever it was, even if the table was updated)?

Código
last_cases <- who_tidy[who_tidy$year == max(who_tidy$year), ]
sum(last_cases[last_cases$country == "United Kingdom of Great Britain and Northern Ireland", "cases"])
sum(last_cases[last_cases$country == "France", "cases"])
# better France

📝 Take a look at table table4b in package {tidyr}. Is it tidydata? If not, what is wrong, how to convert it to tidy data in case it is not already?

Código
table4b |>
  pivot_longer(cols = "1999":"2000", names_to = "year",
               values_to = "cases")

📝 Take a look at the billboard table in the {tidyr} package. Is it tidydata? If not, what is wrong, how to convert it to tidy data in case it is not already?

Código
billboard |>
  pivot_longer(cols = "wk1":"wk76",
               names_to = "week",
               names_prefix = "wk",
               values_to = "position",
               values_drop_na = TRUE)

🐣 Case study

Let’s perform a case study with the relig_income table in {tidyr} package. As indicated (? relig_income), the table represents the number of people in each annual income bracket (20k = 20 000$) and in each religion.

relig_income

 

Try to answer the questions posed in the workbook

L2: tidyverse (rows)

Tidyverse: actions by rows

Preprocessing: dplyr

Within {tidyverse} we will use the {dplyr} package for the preprocessing process of the data.

data |>
  tidy(...) |>
  filter(...) |>
  select(...) |>
  arrange(...) |>
  modify(...) |> # mutate in the code
  rename(...) |>
  group(...) |>
  count(...) |>
  summarise(...) |>
  plot(...) # actually ggplot

The idea is that the code is as readable as possible, as if it were a list of instructions that when read tells us in a very obvious way what it is doing.

Assumption: tidydata

All the preprocessing process we are going to perform is on the assumption that our data is in tidydata

Remember that in {tidyverse} the pipe operator defined as |> (ctrl+shift+M) will be key: it will be a pipe that traverses the data and transforms it.

Let us practice with the starwars dataset from the {dplyr} package.

library(tidyverse)
starwars

Sampling

One of the most common operations is what is known in statistics as sampling: a selection or filtering of records (rows) (a subsample).

  • Non-random (by quota): based on logical conditions on the records (filter()).
  • Non-random (intentional/discretionary): based on a position (slice()).
  • Simple random (slice_sample()).
  • Stratified (group_by() + slice_sample()).

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

The simplest action by rows is when filter records based on some logical condition: with filter() only individuals meeting certain conditions will be selected (non-random sampling by conditions).

  • ==, !=: equal or different to (|> filter(variable == "a"))
  • >, <: greater or less than (|> filter(variable < 3))
  • >=, <=: greater or equal or less or equal than (|> filter(variable >= 5))
  • %in%: values belong to a set of discrete options (|> filter(variable %in% c("blue", "green")))
  • between(variable, val1, val2): if continuous values are inside of a range (|> filter(between(variable, 160, 180)))

Filter rows: filter()

These logical conditions can be combined in different ways (and, or, or mutually exclusive).

Important

Remember that inside filter() there must always be something that returns a vector of logical values.

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you go about… filter the characters with brown eyes?

What type of variable is it? –> The eye_color variable is qualitative so it is represented by texts.

starwars |>
  filter(eye_color == "brown")
# A tibble: 21 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Leia Or…    150  49   brown      light      brown           19   fema… femin…
 2 Biggs D…    183  84   black      light      brown           24   male  mascu…
 3 Han Solo    180  80   brown      fair       brown           29   male  mascu…
 4 Yoda         66  17   white      green      brown          896   male  mascu…
 5 Boba Fe…    183  78.2 black      fair       brown           31.5 male  mascu…
 6 Lando C…    177  79   black      dark       brown           31   male  mascu…
 7 Arvel C…     NA  NA   brown      fair       brown           NA   male  mascu…
 8 Wicket …     88  20   brown      brown      brown            8   male  mascu…
 9 Padmé A…    185  45   brown      light      brown           46   fema… femin…
10 Quarsh …    183  NA   black      dark       brown           62   male  mascu…
# ℹ 11 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you go about… filter the characters that do not have brown eyes?

starwars |>
  filter(eye_color != "brown")
# A tibble: 66 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 6 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 7 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 8 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 9 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
10 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
# ℹ 56 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you go about … filter characters that have brown or blue eyes?

starwars |>
  filter(eye_color %in% c("blue", "brown"))
# A tibble: 40 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 3 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 4 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 5 Biggs D…    183    84 black      light      brown           24   male  mascu…
 6 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 7 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
 8 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 9 Han Solo    180    80 brown      fair       brown           29   male  mascu…
10 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
# ℹ 30 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

Note that %in% is equivalent to concatenating several == with a conjunction or (|)

starwars |>
  filter(eye_color == "blue" | eye_color == "brown")
# A tibble: 40 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 3 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 4 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 5 Biggs D…    183    84 black      light      brown           24   male  mascu…
 6 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 7 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
 8 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 9 Han Solo    180    80 brown      fair       brown           29   male  mascu…
10 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
# ℹ 30 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you go about … filter the characters that are between 120 and 160 cm?

What type of variable is it? –> The variable height is a continuous quantitative variable so we must filter by ranges of values (intervals) –> we will use between().

starwars |>
  filter(between(height, 120, 160))
# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Mon Moth…    150    NA auburn     fair       blue              48 fema… femin…
3 Nien Nunb    160    68 none       grey       black             NA male  mascu…
4 Watto        137    NA black      blue, grey yellow            NA male  mascu…
5 Gasgano      122    NA none       white, bl… black             NA male  mascu…
6 Cordé        157    NA brown      light      brown             NA <NA>  <NA>  
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you… filter characters that have eyes and are not human?

starwars |>
  filter(eye_color == "brown" & species != "Human")
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Yoda          66    17 white      green      brown            896 male  mascu…
2 Wicket S…     88    20 brown      brown      brown              8 male  mascu…
3 Eeth Koth    171    NA black      brown      brown             NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter rows: filter()

data |>
  filtro(condition)
starwars |>
  filter(condition)

How would you… filter characters that have eyes and are not human, or are over 60 years old? Think it through: the parentheses are important: \((a+b)*c\) is not the same as \(a+(b*c)\).

starwars |>
  filter((eye_color == "brown" & species != "Human") | birth_year > 60)
# A tibble: 18 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
 2 Wilhuff…    180    NA auburn, g… fair       blue              64 male  mascu…
 3 Chewbac…    228   112 brown      unknown    blue             200 male  mascu…
 4 Jabba D…    175  1358 <NA>       green-tan… orange           600 herm… mascu…
 5 Yoda         66    17 white      green      brown            896 male  mascu…
 6 Palpati…    170    75 grey       pale       yellow            82 male  mascu…
 7 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 8 Qui-Gon…    193    89 brown      fair       blue              92 male  mascu…
 9 Finis V…    170    NA blond      fair       blue              91 male  mascu…
10 Quarsh …    183    NA black      dark       brown             62 male  mascu…
11 Shmi Sk…    163    NA black      fair       brown             72 fema… femin…
12 Mace Wi…    188    84 none       dark       brown             72 male  mascu…
13 Ki-Adi-…    198    82 white      pale       yellow            92 male  mascu…
14 Eeth Ko…    171    NA black      brown      brown             NA male  mascu…
15 Cliegg …    183    NA brown      fair       blue              82 male  mascu…
16 Dooku       193    80 white      fair       brown            102 male  mascu…
17 Bail Pr…    191    NA black      tan        brown             67 male  mascu…
18 Jango F…    183    79 black      tan        brown             66 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Drop missings: drop_na()

data |>
  drop_missings(var1, var2, ...)
starwars |>
  drop_na(var1, var2, ...)

There is a special filter for one of the most common operations in debugging: remove absent. For this we can use inside a filter is.na(), which returns TRUE/FALSE depending on whether it is absent, or …

Use drop_na(): if we do not specify a variable, it removes records with missing in any variable. Later on we will see how to impute those missing

starwars |>
  drop_na(mass, height)
# A tibble: 7 × 4
  name                mass height hair_color 
  <chr>              <dbl>  <int> <chr>      
1 Luke Skywalker        77    172 blond      
2 C-3PO                 75    167 <NA>       
3 R2-D2                 32     96 <NA>       
4 Darth Vader          136    202 none       
5 Leia Organa           49    150 brown      
6 Owen Lars            120    178 brown, grey
7 Beru Whitesun Lars    75    165 brown      
starwars |>
  drop_na()
# A tibble: 7 × 4
  name                mass height hair_color   
  <chr>              <dbl>  <int> <chr>        
1 Luke Skywalker        77    172 blond        
2 Darth Vader          136    202 none         
3 Leia Organa           49    150 brown        
4 Owen Lars            120    178 brown, grey  
5 Beru Whitesun Lars    75    165 brown        
6 Biggs Darklighter     84    183 black        
7 Obi-Wan Kenobi        77    182 auburn, white

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Select from the starwars set only those characters that are androids or whose species value is unknown.

Código
starwars |>
  filter(species == "Droid" | is.na(species))

📝 Select from the starwars set only the characters whose weight is between 65 and 90 kg.

Código
starwars |> filter(between(mass, 65, 90))

📝 After clearing absent in all variables, select from the starwars set only the characters that are human and come from Tatooine.

Código
starwars |>
  drop_na() |> 
  filter(species == "Human" & homeworld == "Tatooine")

📝 Select from the original starwars set non-human characters, male in sex and measuring between 120 and 170 cm, or characters with brown or red eyes.

Código
starwars |>
  filter((species != "Human" & sex == "male" &
            between(height, 120, 170)) |
           eye_color %in% c("brown", "red"))

📝 Look for information in the str_detect() function help of the {stringr} package (loaded in {tidyverse}). Tip: test the functions you are going to use with some test vector beforehand so that you can check how they work. After you know what it does, filter out only those characters with the last name Skywalker.

Código
starwars |> filter(str_detect(name, "Skywalker"))

Slices of data: slice()

data |> slice(positions)
starwars |> slice(positions)

Sometimes we may be interested in performing a non-random discretionary sampling, or in other words, filter by position: with slice(positions) we can select specific rows by passing as argument a index vector.

# fila 1
starwars |>
  slice(1)
# A tibble: 1 × 4
  name           height  mass hair_color
  <chr>           <int> <dbl> <chr>     
1 Luke Skywalker    172    77 blond     
# from the 7th to the 9th row
starwars |>
  slice(7:9)
# A tibble: 3 × 4
  name               height  mass hair_color
  <chr>               <int> <dbl> <chr>     
1 Beru Whitesun Lars    165    75 brown     
2 R5-D4                  97    32 <NA>      
3 Biggs Darklighter     183    84 black     
# 2, 7, 10 and 31th rows
starwars |>
  slice(c(2, 7, 10, 31))
# A tibble: 4 × 8
  name             height  mass hair_color skin_color eye_color birth_year sex  
  <chr>             <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
1 C-3PO               167    75 <NA>       gold       yellow           112 none 
2 Beru Whitesun L…    165    75 brown      light      blue              47 fema…
3 Obi-Wan Kenobi      182    77 auburn, w… fair       blue-gray         57 male 
4 Qui-Gon Jinn        193    89 brown      fair       blue              92 male 

Slices of data: slice()

data |>
  slice(positions)
starwars |>
  slice(positions)

We have default options:

  • with slice_head(n = ...) and slice_tail(n = ...) we can get the header and tail of the table
starwars |> slice_head(n = 2)
# A tibble: 2 × 4
  name           height  mass hair_color
  <chr>           <int> <dbl> <chr>     
1 Luke Skywalker    172    77 blond     
2 C-3PO             167    75 <NA>      
starwars |> slice_tail(n = 2)
# A tibble: 2 × 4
  name           height  mass hair_color
  <chr>           <int> <dbl> <chr>     
1 BB8                NA    NA none      
2 Captain Phasma     NA    NA none      

Slices of data: slice()

data |>
  slice(positions)
starwars |>
  slice(positions)

We have default options:

  • with slice_max() and slice_min() we get the rows with smallest/largest value of a variable (if tie, all unless with_ties = FALSE) which we indicate in order_by = ....
starwars |> slice_min(mass, n = 2)
# A tibble: 2 × 4
  name         height  mass hair_color
  <chr>         <int> <dbl> <chr>     
1 Ratts Tyerel     79    15 none      
2 Yoda             66    17 white     
starwars |> slice_max(height, n = 2)
# A tibble: 2 × 4
  name        height  mass hair_color
  <chr>        <int> <dbl> <chr>     
1 Yarael Poof    264    NA none      
2 Tarfful        234   136 brown     

Random sampling: slice_sample()

data |>
  slice_aleatorias(positions)
starwars |>
  slice_sample(positions)

The so-called simple random sampling is based on selecting individuals randomly, so that each one has certain probabilities of being selected. With slice_sample(n = ...) we can randomly extract n (a priori equiprobable) records.

starwars |> slice_sample(n = 2)
# A tibble: 2 × 14
  name    height  mass hair_color skin_color eye_color birth_year sex   gender  
  <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>   
1 Lama Su    229    88 none       grey       black             NA male  masculi…
2 BB8         NA    NA none       none       black             NA none  masculi…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Important

“Random” does not imply equiprobable: a normal die is just as random as a trick die. There are no things “more random” than others, they simply have different underlying probability laws.

Random sampling: slice_sample()

data |>
  slice_random(positions)
starwars |>
  slice_sample(positions)

We can also indicate the proportion of data to sample (instead of the number) and if we want it to be with replacement (that can be repeated).

# 5% of random rows with replacement
starwars |> 
  slice_sample(prop = 0.05, replace = TRUE)
# A tibble: 4 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Palpatine    170    75 grey       pale       yellow            82 male  mascu…
2 Mas Amed…    196    NA none       blue       blue              NA male  mascu…
3 Kit Fisto    196    87 none       green      black             NA male  mascu…
4 Greedo       173    74 <NA>       green      black             44 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Random sampling: slice_sample()

data |>
  slice_random(positions)
starwars |>
  slice_sample(positions)

As we said, “random” is not the same as “equiprobable”, so we can pass a probability vector. For example, let’s force that it is very improbable to draw a row other than the first two rows

starwars |>
  slice_sample(n = 2, weight_by = c(0.495, 0.495, rep(0.01/85, 85)))
# A tibble: 2 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
starwars |>
  slice_sample(n = 2, weight_by = c(0.495, 0.495, rep(0.01/85, 85)))
# A tibble: 2 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

sample()

The slice_sample() function is simply a {tidyverse} integration of the basic R function known as sample() that allows us to sample elements

For example, let’s sample 10 rolls of a die, telling it

  • support of our random variable (allowed values in x)
  • sample size (size)
  • replacement (if TRUE then they can come out repeated, as in the case of the die).
sample(x = 1:6, size = 10, replace = TRUE)
 [1] 5 4 3 6 5 3 5 2 4 3

sample()

The previous option generates events of a random variable equiprobable but as before, we can assign a vector of probabilities or mass function to it with the argument prob = ....

sample(x = 1:6, size = 50, replace = TRUE,
       prob = c(0.5, 0.2, 0.1, 0.1, 0.05, 0.05))
 [1] 1 1 2 2 2 1 3 3 1 1 1 3 2 1 6 2 2 1 1 5 1 1 4 1 1 3 1 2 1 1 1 1 3 1 1 2 4 2
[39] 1 2 4 5 1 4 1 5 1 1 2 1

sample()

How would you make the following statement?

 

Suppose that seasonal flu episodes have been studied in a city. Let \(X_m\) and \(X_p\) be random variables such that \(X_m=1\) if the mother has flu, \(X_m=0\) if the mother does not have flu, \(X_p=1\) if the father has flu and \(X_p=0\) if the father does not have flu. The theoretical model associated with this type of epidemics indicates that the joint distribution is given by \(P(X_m = 1, X_p=1)=0.02\), \(P(X_m = 1, X_p=0)=0.08\), \(P(X_m = 1, X_p=0)=0. 1\) and \(P(X_m = 0, X_p=0)=0.8\)

Generate a sample of size \(n = 1000\) (support "10", "01", "00" and "11") by making use of runif() and by making use of sample().

Sort by rows: arrange()

data |> sort(var1, var2, ...)
starwars |> arrange(var1, var2, ...)

We can also order by rows according to some variable with arrange().

starwars |> arrange(mass)
# A tibble: 5 × 6
  name                  height  mass hair_color skin_color  eye_color
  <chr>                  <int> <dbl> <chr>      <chr>       <chr>    
1 Ratts Tyerel              79    15 none       grey, blue  unknown  
2 Yoda                      66    17 white      green       brown    
3 Wicket Systri Warrick     88    20 brown      brown       brown    
4 R2-D2                     96    32 <NA>       white, blue red      
5 R5-D4                     97    32 <NA>       white, red  red      

By from lowest to highest but we can reverse the order with desc().

starwars |> arrange(desc(height))
# A tibble: 5 × 3
  name         height  mass
  <chr>         <int> <dbl>
1 Yarael Poof     264    NA
2 Tarfful         234   136
3 Lama Su         229    88
4 Chewbacca       228   112
5 Roos Tarpals    224    82
starwars |> arrange(mass, desc(height))
# A tibble: 5 × 3
  name                  height  mass
  <chr>                  <int> <dbl>
1 Ratts Tyerel              79    15
2 Yoda                      66    17
3 Wicket Systri Warrick     88    20
4 R5-D4                     97    32
5 R2-D2                     96    32

Remove duplicates: distinct()

data |> no_duplicates(var1, var2, ...)
starwars |> distinct(var1, var2, ...)

Many times we will need to make sure that there are no duplicates in some variable (DNI) and we can delete duplicate rows with distinct().

starwars |> distinct(sex)
# A tibble: 5 × 1
  sex           
  <chr>         
1 male          
2 none          
3 female        
4 hermaphroditic
5 <NA>          

To keep all the columns of the table we will use .keep_all = TRUE.

starwars |> distinct(sex, .keep_all = TRUE)
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
3 Leia Org…    150    49 brown      light      brown             19 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Including rows: bind_rows()

tibble1 |> include_rows(tibble2)
tibble1 |> bind_rows(tibble2)

Finally, we can bind new rows with bind_rows() with new observations in table (if columns do not match fill with absent)

data <-
  tibble("name" = c("javi", "laura"), "age" = c(33, 50))
data
# A tibble: 2 × 2
  name    age
  <chr> <dbl>
1 javi     33
2 laura    50
data |> bind_rows(tibble("name" = c("carlos", NA), "cp" = c(28045, 28019)))
# A tibble: 4 × 3
  name     age    cp
  <chr>  <dbl> <dbl>
1 javi      33    NA
2 laura     50    NA
3 carlos    NA 28045
4 <NA>      NA 28019

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Select only the characters that are human and brown-eyed, then sort them in descending height and ascending weight.

Código
starwars |>
  filter(eye_color == "brown" & species == "Human") |> 
  arrange(height, desc(mass))

📝 Randomly extracts 3 records.

Código
starwars |> slice_sample(n = 3)

📝 Extracts 10% of the records randomly.

Código
starwars |> slice_sample(prop = 0.1)

📝R andomly draws 10 characters but in such a way that the probability of each character being drawn is proportional to its weight (heavier, more likely).

Código
starwars |>
  drop_na(mass) |> 
  slice_sample(n = 10, weight_by = mass)

📝 Select the 3 oldest characters.

Código
starwars |> slice_max(birth_year, n = 3)

📝 To find out what unique values are in the hair color, remove duplicates of the hair_color variable by first removing the missing ones from the hair_color variable.

Código
starwars |>
  drop_na(hair_color) |> 
  distinct(hair_color)

📝 Of the characters that are human and taller than 160 cm, eliminate duplicates in eye color, eliminate absent in weight, select the 3 tallest, and order from tallest to shortest in weight. Return the table.

Código
starwars |>
  filter(species == "Human" & height > 160) |> 
  distinct(eye_color, .keep_all = TRUE) |> 
  drop_na(mass) |> 
  slice_max(height, n = 3) |> 
  arrange(desc(mass))

🐣 Case study

Let’s go back to a known dataset: in the {datasets} package (already installed by default) we had several datasets and one of them was airquality which we already worked with. The data captures daily measurements (n = 153 observations) of air quality in New York, from May to September 1973.

At that time we worked it from the R base perspective and extracted some variables from it. The objective now will be to work it from the {tidyverse} perspective looking at the differences from one form to the other.

library(datasets)
airquality

Try to answer the questions posed in the workbook.

L3: tidyverse (columns)

Tidyverse: actions by columns

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

Up to now all operations performed (even if we used column info) were by rows. In the case of columns, the simplest action is to select variables by name with select(), giving as arguments the column names without quotes.

starwars |> select(name, hair_color)
# A tibble: 87 × 2
   name               hair_color   
   <chr>              <chr>        
 1 Luke Skywalker     blond        
 2 C-3PO              <NA>         
 3 R2-D2              <NA>         
 4 Darth Vader        none         
 5 Leia Organa        brown        
 6 Owen Lars          brown, grey  
 7 Beru Whitesun Lars brown        
 8 R5-D4              <NA>         
 9 Biggs Darklighter  black        
10 Obi-Wan Kenobi     auburn, white
# ℹ 77 more rows

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

The select() function allows us to select several variables at once, including concatenating their names as if they were numerical indexes with :

starwars |> select(name:eye_color) 
# A tibble: 4 × 6
  name           height  mass hair_color skin_color  eye_color
  <chr>           <int> <dbl> <chr>      <chr>       <chr>    
1 Luke Skywalker    172    77 blond      fair        blue     
2 C-3PO             167    75 <NA>       gold        yellow   
3 R2-D2              96    32 <NA>       white, blue red      
4 Darth Vader       202   136 none       white       yellow   

And we can deselect columns with - in front of it

starwars |>  select(-mass, -(eye_color:starships))
# A tibble: 4 × 4
  name           height hair_color skin_color 
  <chr>           <int> <chr>      <chr>      
1 Luke Skywalker    172 blond      fair       
2 C-3PO             167 <NA>       gold       
3 R2-D2              96 <NA>       white, blue
4 Darth Vader       202 none       white      

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

We have also reserved words: everything() all variables….

starwars |> select(mass, homeworld, everything())
# A tibble: 4 × 14
   mass homeworld name   height hair_color skin_color eye_color birth_year sex  
  <dbl> <chr>     <chr>   <int> <chr>      <chr>      <chr>          <dbl> <chr>
1    77 Tatooine  Luke …    172 blond      fair       blue            19   male 
2    75 Tatooine  C-3PO     167 <NA>       gold       yellow         112   none 
3    32 Naboo     R2-D2      96 <NA>       white, bl… red             33   none 
4   136 Tatooine  Darth…    202 none       white      yellow          41.9 male 
# ℹ 5 more variables: gender <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

…and last_col() to refer to last column.

starwars |> select(name:mass, homeworld, last_col())
# A tibble: 4 × 5
  name           height  mass homeworld starships
  <chr>           <int> <dbl> <chr>     <list>   
1 Luke Skywalker    172    77 Tatooine  <chr [2]>
2 C-3PO             167    75 Tatooine  <chr [0]>
3 R2-D2              96    32 Naboo     <chr [0]>
4 Darth Vader       202   136 Tatooine  <chr [1]>

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

We can also play with patterns in the name, those that begin with a prefix (starts_with()), [end with a suffix]{. hl-purple} (ends_with()), contain text (contains()) or fulfill a regular expression (matches()).

# variables which col name finish as "color" and contains sex and gender
starwars |> select(ends_with("color"), matches("sex|gender"))
# A tibble: 87 × 5
   hair_color    skin_color  eye_color sex    gender   
   <chr>         <chr>       <chr>     <chr>  <chr>    
 1 blond         fair        blue      male   masculine
 2 <NA>          gold        yellow    none   masculine
 3 <NA>          white, blue red       none   masculine
 4 none          white       yellow    male   masculine
 5 brown         light       brown     female feminine 
 6 brown, grey   light       blue      male   masculine
 7 brown         light       blue      female feminine 
 8 <NA>          white, red  red       none   masculine
 9 black         light       brown     male   masculine
10 auburn, white fair        blue-gray male   masculine
# ℹ 77 more rows

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

We can even select by numeric range if we have variables with a prefix and numbers.

data <-
  tibble("wk1" = c(115, 141, 232), "wk2" = c(7, NA, 17),
         "wk3" = c(95, 162, NA), "wk4" = c(11, 19, 15),
         "wk5" = c(NA, 262, 190), "wk6" = c(21, 15, 23))

With num_range() we can select with a prefix and a numeric sequence.

data |> select(num_range("wk", 1:4))
# A tibble: 3 × 4
    wk1   wk2   wk3   wk4
  <dbl> <dbl> <dbl> <dbl>
1   115     7    95    11
2   141    NA   162    19
3   232    17    NA    15

Select columns: select()

data |> select(var1, var2, ...)
starwars |> select(var1, var2, ...)

Finally, we can select columns by datatatype using where() and inside a function that returns a logical value of datatype.

# just numeric and string columns
starwars |> select(where(is.numeric) | where(is.character))
# A tibble: 87 × 11
   height  mass birth_year name     hair_color skin_color eye_color sex   gender
    <int> <dbl>      <dbl> <chr>    <chr>      <chr>      <chr>     <chr> <chr> 
 1    172    77       19   Luke Sk… blond      fair       blue      male  mascu…
 2    167    75      112   C-3PO    <NA>       gold       yellow    none  mascu…
 3     96    32       33   R2-D2    <NA>       white, bl… red       none  mascu…
 4    202   136       41.9 Darth V… none       white      yellow    male  mascu…
 5    150    49       19   Leia Or… brown      light      brown     fema… femin…
 6    178   120       52   Owen La… brown, gr… light      blue      male  mascu…
 7    165    75       47   Beru Wh… brown      light      blue      fema… femin…
 8     97    32       NA   R5-D4    <NA>       white, red red       none  mascu…
 9    183    84       24   Biggs D… black      light      brown     male  mascu…
10    182    77       57   Obi-Wan… auburn, w… fair       blue-gray male  mascu…
# ℹ 77 more rows
# ℹ 2 more variables: homeworld <chr>, species <chr>

Move columns: relocate()

data |>
  move(var1, after = var2)
starwars |>
  relocate(var1, .after = var2)

To facilitate the relocation of variables we have a function for it, relocate(), indicating in .after or .before behind or in front of which columns we want to move them.

starwars |> relocate(species, .before = name)
# A tibble: 87 × 14
   species name    height  mass hair_color skin_color eye_color birth_year sex  
   <chr>   <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Human   Luke S…    172    77 blond      fair       blue            19   male 
 2 Droid   C-3PO      167    75 <NA>       gold       yellow         112   none 
 3 Droid   R2-D2       96    32 <NA>       white, bl… red             33   none 
 4 Human   Darth …    202   136 none       white      yellow          41.9 male 
 5 Human   Leia O…    150    49 brown      light      brown           19   fema…
 6 Human   Owen L…    178   120 brown, gr… light      blue            52   male 
 7 Human   Beru W…    165    75 brown      light      blue            47   fema…
 8 Droid   R5-D4       97    32 <NA>       white, red red             NA   none 
 9 Human   Biggs …    183    84 black      light      brown           24   male 
10 Human   Obi-Wa…    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 5 more variables: gender <chr>, homeworld <chr>, films <list>,
#   vehicles <list>, starships <list>

Rename: rename()

data |> rename(new = old)
starwars |> rename(new = old)

Sometimes we may also want to modify the “meta-information” of the data, renaming columns. To do this we will use rename() by typing first the new name and then the old.

starwars |> rename(nombre = name, altura = height, peso = mass)
# A tibble: 87 × 14
   nombre   altura  peso hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Extract columns: pull()

data |> extract(var)
starwars |> pull(var)

If you look at the output of the select() still a tibble table, it preserves the nature of our data.

starwars |> select(name)
# A tibble: 87 × 1
   name              
   <chr>             
 1 Luke Skywalker    
 2 C-3PO             
 3 R2-D2             
 4 Darth Vader       
 5 Leia Organa       
 6 Owen Lars         
 7 Beru Whitesun Lars
 8 R5-D4             
 9 Biggs Darklighter 
10 Obi-Wan Kenobi    
# ℹ 77 more rows

Extract columns: pull()

data |> extract(var)
starwars |> pull(var)

Sometimes we will not want such a structure but literally extract the column in a VECTOR, something we can do with pull().

starwars |> pull(name)
 [1] "Luke Skywalker"        "C-3PO"                 "R2-D2"                
 [4] "Darth Vader"           "Leia Organa"           "Owen Lars"            
 [7] "Beru Whitesun Lars"    "R5-D4"                 "Biggs Darklighter"    
[10] "Obi-Wan Kenobi"        "Anakin Skywalker"      "Wilhuff Tarkin"       
[13] "Chewbacca"             "Han Solo"              "Greedo"               
[16] "Jabba Desilijic Tiure" "Wedge Antilles"        "Jek Tono Porkins"     
[19] "Yoda"                  "Palpatine"             "Boba Fett"            
[22] "IG-88"                 "Bossk"                 "Lando Calrissian"     
[25] "Lobot"                 "Ackbar"                "Mon Mothma"           
[28] "Arvel Crynyd"          "Wicket Systri Warrick" "Nien Nunb"            
[31] "Qui-Gon Jinn"          "Nute Gunray"           "Finis Valorum"        
[34] "Padmé Amidala"         "Jar Jar Binks"         "Roos Tarpals"         
[37] "Rugor Nass"            "Ric Olié"              "Watto"                
[40] "Sebulba"               "Quarsh Panaka"         "Shmi Skywalker"       
[43] "Darth Maul"            "Bib Fortuna"           "Ayla Secura"          
[46] "Ratts Tyerel"          "Dud Bolt"              "Gasgano"              
[49] "Ben Quadinaros"        "Mace Windu"            "Ki-Adi-Mundi"         
[52] "Kit Fisto"             "Eeth Koth"             "Adi Gallia"           
[55] "Saesee Tiin"           "Yarael Poof"           "Plo Koon"             
[58] "Mas Amedda"            "Gregar Typho"          "Cordé"                
[61] "Cliegg Lars"           "Poggle the Lesser"     "Luminara Unduli"      
[64] "Barriss Offee"         "Dormé"                 "Dooku"                
[67] "Bail Prestor Organa"   "Jango Fett"            "Zam Wesell"           
[70] "Dexter Jettster"       "Lama Su"               "Taun We"              
[73] "Jocasta Nu"            "R4-P17"                "Wat Tambor"           
[76] "San Hill"              "Shaak Ti"              "Grievous"             
[79] "Tarfful"               "Raymus Antilles"       "Sly Moore"            
[82] "Tion Medon"            "Finn"                  "Rey"                  
[85] "Poe Dameron"           "BB8"                   "Captain Phasma"       

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Filter the set of characters and keep only those that do not have a missing data in the height variable. With the data obtained from the previous filter, select only the variables name, height, as well as all those variables that CONTAIN the word color in their name.

Código
starwars_2 <-
  starwars |> 
  drop_na(height) |> 
  select(name, height, contains("color"))

📝 With the data obtained from the previous Exercise, translate the names of the columns into Spanish (or your motherlanguage).

Código
starwars_2 |> 
  rename(nombre = name, altura = height, color_pelo = hair_color,
         color_piel = skin_color, color_ojos = eye_color)

📝 With the data obtained from Exercise 1, place the hair color variable just after the name variable.

Con los data obtenidos del Exercise anterior, coloca la variable de color de pelo justo detrás de la variable de nombres.

Código
starwars_2 |>
  relocate(hair_color, .after = name)

📝 With the data obtained from the Exercise 1, check how many unique modalities there are in the hair color variable (without using unique()).

Código
starwars_2 |>
  distinct(hair_color)

📝 From the original data set, it removes the list type columns, and then removes duplicates in the eye_color variable. After removing duplicates it extracts that column into a vector.

Código
starwars |> 
  select(-where(is.list)) |> 
  distinct(eye_color, .keep_all = TRUE) |> 
  pull(eye_color)

📝 From the original starwars dataset, with only the characters whose height is known, extract in a vector with that variable.

Código
starwars |> 
  drop_na(height) |> 
  pull(height)

📝 After obtaining the vector from the previous Exercise, use this vector to randomly sample 50% of the data so that the probability of each character being chosen is inversely proportional to their height (shorter, more options).

Código
heights <-
  starwars |> 
  drop_na(height) |> 
  pull(height)
  
starwars |> 
  slice_sample(prop = 0.5, weight_by = 1/heights)

Modify columns: mutate()

data |> modify(new_var = funcion())
starwars |> mutate(new_var = function())

In many occasions we will want to modify or create variables with mutate().

Let’s create for example a new variable height_m with the height in meters.

starwars |> mutate(height_m = height / 100)
# A tibble: 87 × 15
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, height_m <dbl>

Modify columns: mutate()

data |> modify(new_var = funcion())
starwars |> mutate(new_var = function())

In addition with the optional arguments we can reposition the modified column

starwars |> 
  mutate(height_m = height / 100,
         BMI = mass / (height_m^2), .before = name)
# A tibble: 87 × 16
   height_m   BMI name   height  mass hair_color skin_color eye_color birth_year
      <dbl> <dbl> <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl>
 1     1.72  26.0 Luke …    172    77 blond      fair       blue            19  
 2     1.67  26.9 C-3PO     167    75 <NA>       gold       yellow         112  
 3     0.96  34.7 R2-D2      96    32 <NA>       white, bl… red             33  
 4     2.02  33.3 Darth…    202   136 none       white      yellow          41.9
 5     1.5   21.8 Leia …    150    49 brown      light      brown           19  
 6     1.78  37.9 Owen …    178   120 brown, gr… light      blue            52  
 7     1.65  27.5 Beru …    165    75 brown      light      blue            47  
 8     0.97  34.0 R5-D4      97    32 <NA>       white, red red             NA  
 9     1.83  25.1 Biggs…    183    84 black      light      brown           24  
10     1.82  23.2 Obi-W…    182    77 auburn, w… fair       blue-gray       57  
# ℹ 77 more rows
# ℹ 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Modify columns: mutate()

data |> modify(new_var = funcion())
starwars |> mutate(new_var = function())

Important

When we apply mutate(), we must remember that the operations are performed vector by vector, element by element, so the function we use inside must return a vector of equal length. Otherwise, it will return a constant.

starwars |> 
  mutate(constante = mean(mass, na.rm = TRUE), .before = name)
# A tibble: 87 × 15
   constante name  height  mass hair_color skin_color eye_color birth_year sex  
       <dbl> <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1      97.3 Luke…    172    77 blond      fair       blue            19   male 
 2      97.3 C-3PO    167    75 <NA>       gold       yellow         112   none 
 3      97.3 R2-D2     96    32 <NA>       white, bl… red             33   none 
 4      97.3 Dart…    202   136 none       white      yellow          41.9 male 
 5      97.3 Leia…    150    49 brown      light      brown           19   fema…
 6      97.3 Owen…    178   120 brown, gr… light      blue            52   male 
 7      97.3 Beru…    165    75 brown      light      blue            47   fema…
 8      97.3 R5-D4     97    32 <NA>       white, red red             NA   none 
 9      97.3 Bigg…    183    84 black      light      brown           24   male 
10      97.3 Obi-…    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Recategorize: if_else()

We can also combine mutate() with the if_else() control expression to recategorize the variable: if a condition is met, it does one thing, otherwise another.

starwars |> 
  mutate(human = if_else(species == "Human", "Human", "Not Human"),
         .after = name) |> 
  select(name:mass)
# A tibble: 87 × 4
   name               human     height  mass
   <chr>              <chr>      <int> <dbl>
 1 Luke Skywalker     Human        172    77
 2 C-3PO              Not Human    167    75
 3 R2-D2              Not Human     96    32
 4 Darth Vader        Human        202   136
 5 Leia Organa        Human        150    49
 6 Owen Lars          Human        178   120
 7 Beru Whitesun Lars Human        165    75
 8 R5-D4              Not Human     97    32
 9 Biggs Darklighter  Human        183    84
10 Obi-Wan Kenobi     Human        182    77
# ℹ 77 more rows

Recategorize: case_when()

For more complex categorizations we have case_when(), for example, to create a category of characters based on their height.

starwars |> 
  drop_na(height) |> 
  mutate(altura = case_when(height < 120 ~ "dwarf",
                            height < 160 ~ "short",
                            height < 180 ~ "normal",
                            height < 200 ~ "tall",
                            TRUE ~ "giant"), .before = name)
# A tibble: 81 × 15
   altura name     height  mass hair_color skin_color eye_color birth_year sex  
   <chr>  <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 normal Luke Sk…    172    77 blond      fair       blue            19   male 
 2 normal C-3PO       167    75 <NA>       gold       yellow         112   none 
 3 dwarf  R2-D2        96    32 <NA>       white, bl… red             33   none 
 4 giant  Darth V…    202   136 none       white      yellow          41.9 male 
 5 short  Leia Or…    150    49 brown      light      brown           19   fema…
 6 normal Owen La…    178   120 brown, gr… light      blue            52   male 
 7 normal Beru Wh…    165    75 brown      light      blue            47   fema…
 8 dwarf  R5-D4        97    32 <NA>       white, red red             NA   none 
 9 tall   Biggs D…    183    84 black      light      brown           24   male 
10 tall   Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 71 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

nest data

We can also nest or embed datasets inside each other. Imagine that we have a dataset of x and y variables with 2 records, another one with the same variables but only one record and another one with 3 records.

data_1 <- tibble("x" = c(0, 2), "y" = c(-1, NA))
data_2 <- tibble("x" = c(NA), "y" = c(5))
data_3 <- tibble("x" = c(-2, 6, 7), "y" = c(1.5, NA, -2))

So far the only way we know how to bind the 3 datasets is by using bind_rows() (by the way, if you use the argument .id = “variable_name” we can make it add a new variable that tells us to which dataset each row belonged.

data <- bind_rows(data_1, data_2, data_3, .id = "dataset")
data
# A tibble: 6 × 3
  dataset     x     y
  <chr>   <dbl> <dbl>
1 1           0  -1  
2 1           2  NA  
3 2          NA   5  
4 3          -2   1.5
5 3           6  NA  
6 3           7  -2  

nest data

data <- bind_rows(data_1, data_2, data_3, .id = "dataset")
data
# A tibble: 6 × 3
  dataset     x     y
  <chr>   <dbl> <dbl>
1 1           0  -1  
2 1           2  NA  
3 2          NA   5  
4 3          -2   1.5
5 3           6  NA  
6 3           7  -2  

However, in many occasions we will want to have all 3 in the same object BUT each dataset on its own: an object (a list) that stores the 3 datasets separated from each other

To do this we will use the nest() function indicating which common variables form the datasets (in this case x and y).

data_nest <-
  data |>
  nest(data = c(x, y))
data_nest
# A tibble: 3 × 2
  dataset data            
  <chr>   <list>          
1 1       <tibble [2 × 2]>
2 2       <tibble [1 × 2]>
3 3       <tibble [3 × 2]>

nest data

data_nest
# A tibble: 3 × 2
  dataset data            
  <chr>   <list>          
1 1       <tibble [2 × 2]>
2 2       <tibble [1 × 2]>
3 3       <tibble [3 × 2]>

Note that now data_nest is a list as each stored dataset could have different lengths

nest data

To unnest we can use unnest() indicating the column containing the datasets

data_nest |> unnest(cols = c(data))
# A tibble: 6 × 3
  dataset     x     y
  <chr>   <dbl> <dbl>
1 1           0  -1  
2 1           2  NA  
3 2          NA   5  
4 3          -2   1.5
5 3           6  NA  
6 3           7  -2  

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Select only the variables name, height and as well as all those variables related to the color, while keeping only those that are not absent in the height.

Código
starwars |> 
  select(name, height, contains("color")) |> 
  drop_na(height)

📝 With the data obtained from the previous Exercise, translate the names of the columns into Spanish or your mother language.

Código
starwars |> 
  select(name, height, contains("color")) |> 
  drop_na(height) |> 
  rename(nombre = name, altura = height,
         color_pelo = eye_color, color_piel = skin_color,
         color_pelo = hair_color)

📝 With the data obtained from the previous Exercise, place the hair color variable just after the name variable.

Código
starwars |>
  select(name, height, contains("color")) |> 
  drop_na(height) |> 
  rename(nombre = name, altura = height,
         color_pelo = eye_color, color_piel = skin_color,
         color_pelo = hair_color) |> 
  relocate(color_pelo, .after = nombre)

📝 With the original data, check how many unique modalities there are in the hair color variable.

Código
starwars |> 
  distinct(hair_color) |> 
  nrow()

📝 From the original dataset, select only the numeric and text variables. Then define a new variable called under_18 to recategorize the age variable: TRUE if under age and FALSE if not.

Código
starwars |> 
  select(where(is.numeric) | where(is.character)) |> 
  mutate(under_18 = birth_year < 18)

📝 From the original dataset, create a new column named auburn that tells us TRUE if the hair color contains that word and FALSE otherwise (reminder str_detect()).

Código
starwars |> 
  mutate(auburn = str_detect(hair_color, "auburn"))

📝 From the original dataset, include a column that calculates BMI. After that, create a new variable that values NA if not human, underweight below 18, normal between 18 and 30, overweight above 30.

Código
starwars |> 
  mutate(IMC = mass / ((height/100)^2),
         IMC_recat = case_when(species != "Human" ~ NA,
                               IMC < 18 ~ "underweight",
                               IMC < 30 ~ "normal",
                               TRUE ~ "overweight"),
         .after = name)

🐣 Case study I: CIS feminismos

pending to submit

 

Try to answer the questions posed in the workbook.

🐣 Case study II: Taylor Swift

Let’s go back to the analysis of Taylor Swift songs we did in the previous installment.

library(taylor)
taylor_album_songs
# A tibble: 240 × 29
   album_name   ep    album_release track_number track_name     artist featuring
   <chr>        <lgl> <date>               <int> <chr>          <chr>  <chr>    
 1 Taylor Swift FALSE 2006-10-24               1 Tim McGraw     Taylo… <NA>     
 2 Taylor Swift FALSE 2006-10-24               2 Picture To Bu… Taylo… <NA>     
 3 Taylor Swift FALSE 2006-10-24               3 Teardrops On … Taylo… <NA>     
 4 Taylor Swift FALSE 2006-10-24               4 A Place In Th… Taylo… <NA>     
 5 Taylor Swift FALSE 2006-10-24               5 Cold As You    Taylo… <NA>     
 6 Taylor Swift FALSE 2006-10-24               6 The Outside    Taylo… <NA>     
 7 Taylor Swift FALSE 2006-10-24               7 Tied Together… Taylo… <NA>     
 8 Taylor Swift FALSE 2006-10-24               8 Stay Beautiful Taylo… <NA>     
 9 Taylor Swift FALSE 2006-10-24               9 Should've Sai… Taylo… <NA>     
10 Taylor Swift FALSE 2006-10-24              10 Mary's Song (… Taylo… <NA>     
# ℹ 230 more rows
# ℹ 22 more variables: bonus_track <lgl>, promotional_release <date>,
#   single_release <date>, track_release <date>, danceability <dbl>,
#   energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>,
#   key_name <chr>, mode_name <chr>, key_mode <chr>, lyrics <list>

The difference is that now we will try to do everything from a tidyverse view instead of R base (it is interesting that you try to translate from one to the other to know how to master both views)

Try to answer the questions posed in the workbook.

🐣 Case study III: The Lord of the Rings

To practice some {dplyr} functions we are going to use data from the Lord of the Rings trilogy movies. We will load the data directly from the web (Github in this case), without going through the computer before, simply indicating as path the web where the file is

Código
library(readr)
lotr_1 <-
  read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")
lotr_2 <-
  read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")
lotr_3 <-
  read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")

Try to answer the questions posed in the workbook.

L4: tidyverse (summaries)

Summarise and group_by(). Count and summaries

count()

data |> count(var1, var2)
starwars |> count(var1, var2)

So far we have only transformed or queried the data but we have not generated statistics. Let’s start with the simple: how to count (frequencies)?

When used alone count() will simply return the number of records, but when used with count() variables it calculates what is known as frequencies: number of elements of each modality.

starwars |> count(sex)
# A tibble: 5 × 2
  sex                n
  <chr>          <int>
1 female            16
2 hermaphroditic     1
3 male              60
4 none               6
5 <NA>               4

count()

data |> count(var1, var2)
starwars |> count(var1, var2)

Also if we pass several variables it calculates what is known as a contiguity table. With sort = TRUE it will return the ordered count (most frequent first).

starwars |> count(sex, gender, sort = TRUE)
# A tibble: 6 × 3
  sex            gender        n
  <chr>          <chr>     <int>
1 male           masculine    60
2 female         feminine     16
3 none           masculine     5
4 <NA>           <NA>          4
5 hermaphroditic masculine     1
6 none           feminine      1

group_by()

data |>
  group(var1, var2) |> 
  some_action() |> 
  ungroup()
starwars |>
  group_by(var1, var2) |> 
  some_action() |> 
  ungroup()

One of the most powerful functions to combine with the actions seen is group_by(), which will allow us to group our records beforehand.

starwars |> 
  group_by(sex) |>
  count() |>
  ungroup()
# A tibble: 5 × 2
  sex                n
  <chr>          <int>
1 female            16
2 hermaphroditic     1
3 male              60
4 none               6
5 <NA>               4

group_by()

data |>
  group(var1, var2) |> 
  some_action() |> 
  ungroup()
starwars |>
  group_by(var1, var2) |> 
  some_action() |> 
  ungroup()

When applying group_by() it is important to understand that it DOES NOT MODIFY the data, but creates a group variable (sub-tables for each group) that will modify future actions: the operations will be applied to each sub-table separately

For example, imagine that we want to extract the highest character with slice_max().

starwars |> slice_max(height)
# A tibble: 1 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Yarael P…    264    NA none       white      yellow            NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

group_by()

data |>
  group(var1, var2) |> 
  some_action() |> 
  ungroup()
starwars |>
  group_by(var1, var2) |> 
  some_action() |> 
  ungroup()

What if we want to extract the tallest character but…of each of the sexes?

starwars |>
  group_by(sex) |> 
  slice_max(height) |> 
  ungroup()
# A tibble: 5 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Taun We      213    NA none       grey       black             NA fema… femin…
2 Jabba De…    175  1358 <NA>       green-tan… orange           600 herm… mascu…
3 Yarael P…    264    NA none       white      yellow            NA male  mascu…
4 IG-88        200   140 none       metal      red               15 none  mascu…
5 Gregar T…    185    85 black      dark       brown             NA <NA>  <NA>  
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

group_by()

data |>
  group(var1, var2) |> 
  some_action() |> 
  ungroup()
starwars |>
  group_by(var1, var2) |> 
  some_action() |> 
  ungroup()

The web https://tidydatatutor.com/ allows to visualize the operations of {tidyverse} (doing with the old pipe %>%)

group_by()

data |>
  group(var1, var2) |> 
  some_action() |> 
  ungroup()
starwars |>
  group_by(var1, var2) |> 
  some_action() |>
  ungroup()

Important

You should always remember to make ungroup to remove the created group variable.

The “new” version of {dplyr} now allows to include the group variable in the call to many functions with the argument by = ... or .by = ....

starwars |> slice_max(height, by = sex)
# A tibble: 5 × 6
  name                  height  mass hair_color skin_color       eye_color
  <chr>                  <int> <dbl> <chr>      <chr>            <chr>    
1 Yarael Poof              264    NA none       white            yellow   
2 IG-88                    200   140 none       metal            red      
3 Taun We                  213    NA none       grey             black    
4 Jabba Desilijic Tiure    175  1358 <NA>       green-tan, brown orange   
5 Gregar Typho             185    85 black      dark             brown    

Row-by-row: rowwise()

A very useful option used before an operation is also rowwise(): every operation that comes afterwards will be applied on each row separately. For example, let’s define a dummy set of grades.

grades <- tibble("maths" = c(7.5, 8, 9.1, 3),
                 "language" = c(8, 6, 6.5, 9.2))

If we apply the average directly the value will be identical since it has done the global average, but we would like to get an average per record. For that we will use rowwise().

grades |> 
  rowwise() |> 
  mutate(ave_grades = mean(c(maths, language)))
# A tibble: 4 × 3
# Rowwise: 
  maths language ave_grades
  <dbl>    <dbl>      <dbl>
1   7.5      8         7.75
2   8        6         7   
3   9.1      6.5       7.8 
4   3        9.2       6.1 

Summary: summarise()

data |> simple_summary()
starwars |> summarise()

Finally we have summarise(), which will allow us to get statistical summaries. For example, let’s calculate the average of the heights.

starwars |> 
  drop_na(height) |> 
  summarise(ave_height = mean(height))
# A tibble: 1 × 1
  ave_height
       <dbl>
1       175.

Be careful

Notice that mutate() returns as many rows as original records, while with summarise() it calculates a new summary dataset, only including what is indicated.

Summary: summarise()

data |> simple_summary()
starwars |> summarise()

If we also combine this with the grouping of group_by() or .by = ..., in a few lines of code you can get disaggregated statistics.

starwars |> 
  drop_na(sex, height, mass) |> 
  summarise(ave_height = mean(height),
            ave_mass = mean(mass),
            .by = sex)
# A tibble: 4 × 3
  sex            ave_height ave_mass
  <chr>               <dbl>    <dbl>
1 male                 178.     80.2
2 none                 140      69.8
3 female               172.     54.7
4 hermaphroditic       175    1358  

Summary: reframe()

data |> complex_summary()
starwars |> reframe()

In the new {dplyr} they have included reframe() to avoid summarise() problems when we return more than one value per variable in more complex summaries.

starwars |>
  drop_na(mass) |>
  summarise(quantile(mass))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
# A tibble: 5 × 1
  `quantile(mass)`
             <dbl>
1             15  
2             55.6
3             79  
4             84.5
5           1358  
starwars |>
  drop_na(mass) |>
  reframe(quantile(mass))
# A tibble: 5 × 1
  `quantile(mass)`
             <dbl>
1             15  
2             55.6
3             79  
4             84.5
5           1358  

across()

One trick is to make use of selectors across() and where(). The former allows us to act on several columns by name (with mutate() or summarise()).

starwars |> summarise(ave = across(height:mass, mean, na.rm = TRUE), .by = sex)
# A tibble: 5 × 2
  sex            ave$height  $mass
  <chr>               <dbl>  <dbl>
1 male                 179.   80.2
2 none                 131.   69.8
3 female               172.   54.7
4 hermaphroditic       175  1358  
5 <NA>                 175    81  

The second, where(), allows us to do the same but selecting by type.

starwars |> 
  summarise(across(where(is.numeric), mean, na.rm = TRUE), .by = c(sex, gender))
# A tibble: 6 × 5
  sex            gender    height   mass birth_year
  <chr>          <chr>      <dbl>  <dbl>      <dbl>
1 male           masculine   179.   80.2       84.8
2 none           masculine   140    69.8       53.3
3 female         feminine    172.   54.7       47.2
4 hermaphroditic masculine   175  1358        600  
5 <NA>           <NA>        175    81        NaN  
6 none           feminine     96   NaN        NaN  

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Calculate how many characters there are of each species, ordered from most to least frequent.

Código
starwars |> count(species, sort = TRUE)

📝 After eliminating missing variables for weight and height, add a new variable to calculate the BMI of each character, and determine the average BMI of our characters disaggregated by gender.

Código
starwars |>
  drop_na(mass, height) |> 
  mutate(BMI = mass / ((height/100)^2)) |> 
  summarise(ave_BMI = mean(BMI), .by = sex)

📝 Obtain the youngest character for each gender.

Código
starwars |> # reminder that birth_year is in fact the age
  slice_min(birth_year, by = sex)

📝 Get the age of the youngest and oldest character of each sex.

Código
starwars |>
  drop_na(birth_year) |>
  summarise(min(birth_year), max(birth_year), .by = sex)

📝 Determine the number of characters in each decade (take a look at round(), first without disaggregating and then disaggregated by sex.

Código
starwars |>
  count(birth_decade = round(birth_year, -1))

🐣 Case study I: billboard

We are going to do a summary of what we learned in {tidyverse} with the billboard table of the {tidyr} package. The dataset represents something similar to the top 40 (but American version and a top 100 instead of 40): for each artist and song we store the date when it entered the ranking, and the position it occupied in the ranking in each of the weeks (wk1, wk2, …).

billboard
# A tibble: 317 × 8
   artist         track               date.entered   wk1   wk2   wk3   wk4   wk5
   <chr>          <chr>               <date>       <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2 Pac          Baby Don't Cry (Ke… 2000-02-26      87    82    72    77    87
 2 2Ge+her        The Hardest Part O… 2000-09-02      91    87    92    NA    NA
 3 3 Doors Down   Kryptonite          2000-04-08      81    70    68    67    66
 4 3 Doors Down   Loser               2000-10-21      76    76    72    69    67
 5 504 Boyz       Wobble Wobble       2000-04-15      57    34    25    17    17
 6 98^0           Give Me Just One N… 2000-08-19      51    39    34    26    26
 7 A*Teens        Dancing Queen       2000-07-08      97    97    96    95   100
 8 Aaliyah        I Don't Wanna       2000-01-29      84    62    51    41    38
 9 Aaliyah        Try Again           2000-03-18      59    53    38    28    21
10 Adams, Yolanda Open My Heart       2000-08-26      76    76    74    69    68
# ℹ 307 more rows

Try to answer the questions posed in the workbook.

🐣 Case study II: soccer

Let’s continue practicing what we learned in {tidyverse} with the file futbol.csv, where we have data of the players of the 5 main men’s soccer leagues, from 2005 to 2019, compiling different statistics. The data has been extracted directly using the {worldfootballR} package, which allows us to extract data from https://www.fbref.com.

data <- read_csv(file = "./data/futbol.csv")
data
# A tibble: 40,393 × 16
   season team    league  player     country position date_birth minutes_playing
    <dbl> <chr>   <chr>   <chr>      <chr>   <chr>         <dbl>           <dbl>
 1   2005 Ajaccio Ligue 1 Djamel Ab… ALG     MF             1986              30
 2   2005 Ajaccio Ligue 1 Gaspar Az… POR     DF             1975            1224
 3   2005 Ajaccio Ligue 1 Yacine Be… ALG     MF             1981             140
 4   2005 Ajaccio Ligue 1 Nicolas B… FRA     MF             1976             892
 5   2005 Ajaccio Ligue 1 Marcelinh… BRA     MF             1971             704
 6   2005 Ajaccio Ligue 1 Cyril Cha… FRA     FW             1979            1308
 7   2005 Ajaccio Ligue 1 Xavier Co… FRA     DF             1974            2734
 8   2005 Ajaccio Ligue 1 Renaud Co… FRA     MF             1980             896
 9   2005 Ajaccio Ligue 1 Yohan Dem… FRA     DF,MF          1978            2658
10   2005 Ajaccio Ligue 1 Christoph… FRA     DF             1973              74
# ℹ 40,383 more rows
# ℹ 8 more variables: minutes_per_match_playing <dbl>, goals <dbl>,
#   assist <dbl>, goals_minus_pk <dbl>, pk <dbl>, pk_attemp <dbl>,
#   yellow_card <dbl>, red_card <dbl>

🐣 Case study II: soccer

The variables capture the following information:

  • season, team, league: season, team and league.
  • player, country, position, date_birth: name, country, position and birth of year of each player.
  • minutes_playing, matches: total minutes playing and 90’ matches played
  • goals, assist: goals and assists.
  • pk, pk_attemp, goals_minus_pk: penalties, penalties attempted and goals excluding penalties.
  • yellow_card, red_card: yellow/red cards.

🐣 Case study II: soccer

data
# A tibble: 40,393 × 16
   season team    league  player     country position date_birth minutes_playing
    <dbl> <chr>   <chr>   <chr>      <chr>   <chr>         <dbl>           <dbl>
 1   2005 Ajaccio Ligue 1 Djamel Ab… ALG     MF             1986              30
 2   2005 Ajaccio Ligue 1 Gaspar Az… POR     DF             1975            1224
 3   2005 Ajaccio Ligue 1 Yacine Be… ALG     MF             1981             140
 4   2005 Ajaccio Ligue 1 Nicolas B… FRA     MF             1976             892
 5   2005 Ajaccio Ligue 1 Marcelinh… BRA     MF             1971             704
 6   2005 Ajaccio Ligue 1 Cyril Cha… FRA     FW             1979            1308
 7   2005 Ajaccio Ligue 1 Xavier Co… FRA     DF             1974            2734
 8   2005 Ajaccio Ligue 1 Renaud Co… FRA     MF             1980             896
 9   2005 Ajaccio Ligue 1 Yohan Dem… FRA     DF,MF          1978            2658
10   2005 Ajaccio Ligue 1 Christoph… FRA     DF             1973              74
# ℹ 40,383 more rows
# ℹ 8 more variables: minutes_per_match_playing <dbl>, goals <dbl>,
#   assist <dbl>, goals_minus_pk <dbl>, pk <dbl>, pk_attemp <dbl>,
#   yellow_card <dbl>, red_card <dbl>

 

Try to answer the questions posed in the workbook.

🐣 Case study III: speechs

Although we cannot do arithmetic operations with them, some operations we can do with the text strings will be important. For that we will use the {stringr} package (within the same {lubridate} “universe of packages”).

library(stringr)

The dataset will be discursos (speeches, extracted from https://github.com/lirondos/discursos-de-navidad) where are stored the christmas speeches of the heads of state of Spain from 1946 to 2021 (in dictatorship and in democracy)

load(file = "./data/discursos.RData")
# discursos = speeches in spanish

Try to answer the questions posed in the workbook.

L5: loops

Coming back to R base: loops

Loops

Although in most occasions they can be replaced by other more efficient and readable structures, it is important to know one of the most famous control expressions: the loops.

  • for { }: allows [repeating the same code]{. hl-yellow} in a prefixed and known number of times.

  • while { }: allows repeating the same code but in an undetermined number of times (until a condition is no longer fulfilled).

For loop

A for loop is a structure that allows to repeat a set of commands a finite, prefixed and known number of times given a set of indices.

Let’s define a vector x <- c(0, -7, 1, 4) and another empty variable y. After that we will define a for loop with for () { }: inside the brackets we will indicate an index and some values to traverse, inside the braces the code to execute in each iteration (in this case, fill y as x + 1).

x <- c(0, -7, 1, 4)
y <- c()

For loop

A for loop is a structure that allows to repeat a set of commands a finite, prefixed and known number of times given a set of indices.

Let’s define a vector x <- c(0, -7, 1, 4) and another empty variable y. After that we will define a for loop with for () { }: inside the brackets we will indicate an index and some values to traverse, inside the braces the code to execute in each iteration (in this case, fill y as x + 1).

x <- c(0, -7, 1, 4)
y <- c()

for (i in 1:4) {
  
}

For loop

A for loop is a structure that allows to repeat a set of commands a finite, prefixed and known number of times given a set of indices.

Let’s define a vector x <- c(0, -7, 1, 4) and another empty variable y. After that we will define a for loop with for () { }: inside the brackets we will indicate an index and some values to traverse, inside the braces the code to execute in each iteration (in this case, fill y as x + 1).

x <- c(0, -7, 1, 4)
y <- c()

for (i in 1:4) {
  y[i] <- x[i] + 1
}

For loop

Note that because R works in a default vector manner, the loop is the same as doing x + 1 directly.

x <- c(0, -7, 1, 4)
y <- c()

for (i in 1:4) {
  y[i] <- x[i] + 1
}
y
[1]  1 -6  2  5
y2 <- x + 1
y2
[1]  1 -6  2  5

For loop

Another common option is to indicate the indexes “automatically”: from the first 1 to the last (corresponding to the length of x length(x)).

x <- c(0, -7, 1, 4)
y <- c()

for (i in 1:length(x)) {
  y[i] <- x[i] + 1
}
y
[1]  1 -6  2  5

For loop

Thus the general structure of a for-loop will always be as follows

for (index in set) { 
  código (usually depending on index)
}

In the case of for loops ALWAYS we know how many iterations we have (as many as there are elements in the set to be indexed).

Avoiding loop

As we have already learned with the {microbenchmark} package, we can check how loops are usually very inefficient (hence we should avoid them in most occasions)

library(microbenchmark)
x <- 1:1000
microbenchmark(y <- x^2, 
               for (i in 1:100) { y[i] <- x[i]^2 },
               times = 500)
Unit: nanoseconds
                                    expr    min       lq     mean   median
                                y <- x^2    984   1107.0   1221.8   1189.0
 for (i in 1:100) {     y[i] <- x[i]^2 } 814219 821414.5 867094.2 826867.5
     uq     max neval
   1230    2788   500
 841484 3653469   500

For loop

We can see another example of a combining numbers and text loop: we define a vector of ages and names, and print the i-th name and age.

names <- c("Javi", "Sandra", "Carlos", "Marcos", "Marta")
ages <- c(33, 27, 18, 43, 29)

for (i in 1:5) { 
  
  print(glue("{names[i]} are {ages[i]} old")) 
  
}
Javi are 33 old
Sandra are 27 old
Carlos are 18 old
Marcos are 43 old
Marta are 29 old

For loop

Although they are usually indexed with numeric vectors, loops can be indexed on any vector structure, regardless of the type of the set.

library(stringr)
week_days <- c("monday", "tuesday", "wednesday", "thursday",
               "friday", "saturday", "sunday")

for (days in week_days) {
  
  print(days)
}
[1] "monday"
[1] "tuesday"
[1] "wednesday"
[1] "thursday"
[1] "friday"
[1] "saturday"
[1] "sunday"

For + if-else

Let’s combine conditional structures and loops: using the swiss set of the {datasets} package, let’s assign NA if the fertility values are greater than 80.

for (i in 1:nrow(swiss)) {
  
  if (swiss$Fertility[i] > 80) { 
    
    swiss$Fertility[i] <- NA
    
  }
}

This is «the same» as a vectorized if_else().

data("swiss")
swiss$Fertility <- if_else(swiss$Fertility > 80, NA, swiss$Fertility)

While loop

Another way to create a loop is with the while { } structure, which will loop an unknown number of times, until a condition stops being met (in fact it may never end). For example, we will inialize a variable times <- 1, which we will increment at each step, and we will not exit the loop until times > 3.

times <- 1
while(times <= 3) {
  
  print(glue("Not yet, we are in the {times}-th iteration")) 
  times <- times + 1
  
}
Not yet, we are in the 1-th iteration
Not yet, we are in the 2-th iteration
Not yet, we are in the 3-th iteration
print(glue("Now! We are in the {times}-th iteration")) 
Now! We are in the 4-th iteration

While loop

A while loop will always look like this

while(condition) {
  
  code to be executed while condition is TRUE
  # usually some variable is updated here
  
}

While loop

What happens when the condition is never FALSE? Try it yourself

while (1 > 0) {
  
  print("Press ESC to exit")
  
}

 

Warning

A while { } loop can be quite “dangerous” if we do not control well how to stop it.

While loop

We have two reserved commands to abort a loop or force it forward:

  • break: allows abort a loop even if its end has not been reached
for(i in 1:10) {
  if (i == 3) {
    
    break # if i = 3, we abort
    
  }
  print(i)
}
[1] 1
[1] 2

While loop

We have two reserved commands to abort a loop or force it forward:

  • next: forces a loop to advance to the next iteration
for(i in 1:5) {
  if (i == 3) {
    
    next # if i = 3, we advance to the next iteration
    
  }
  print(i)
}
[1] 1
[1] 2
[1] 4
[1] 5

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Modify the code below to print a message on the screen if and only if all the data in airquality is for a month other than January.

library(datasets)
months <- airquality$Month

if (months == 2) {
  print("No data in January")
}
Código
library(datasets)
months <- airquality$Month

if (all(months != 1)) {
  print("No data in January")
}

📝 Modify the code below to store in a variable called temp_high a TRUE if any of the records has a temperature above 90 degrees Fahrenheit and FALSE in any other case.

temp <- airquality$Temp

if (temp == 100) {
  print("Some of the records have temperatures in excess of 90 degrees Fahrenheit")
}
Código
# Option 1
temp <- airquality$Temp
temp_high <- FALSE
if (any(temp > 90)) {
   temp_high <- TRUE
}

# Option 2
temp_high <- any(airquality$Temp > 90)

📝 Modify the code below to design a for loop of 5 iterations that only loops through the first 5 odd (and at each step of the loop prints them)

for (i in 1:5) {
  
  print(i)
}
Código
for (i in c(1, 3, 5, 7, 9)) {
  
  print(i)
}

📝 Modify the code below to design a while loop that starts with a counter count <- 1 and stops when it reaches 6

count <- 1
while (count == 2) {
  
  print(count)
}
Código
count <- 1
while (count < 6) {
  
  print(count)
  count <- count + 1
  
}

🐣 Case study I: simulations

Try to answer the questions posed in the workbook. On it you will have to design some simulation studies using conditional loops and structures.

🐣 Case study II: Monty Hall problem

Imagine you are in a TV contest where you are given a choice of 3 doors: in one there is a millionaire prize and in the other 2 there is an oreo cookie.

 

Try to answer the questions posed in the workbook. On it you will have to design some simulation studies using conditional loops and structures.

L6: functions and joins

Coming back to R base: functions. Joins

Own functions

Not only can we use default functions that come already loaded in packages, we can also create our own functions to automate tasks. How to create our own function? Let’s look at its basic scheme:

  • Name: for example name_fun (no spaces or strange characters). To the name we assign the reserved word function().

  • Define input arguments (inside function()).

  • Body of the function inside { }.

  • We end the function with the output arguments with return().

name_fun <- function() {
  
}

Own functions

Not only can we use default functions that come already loaded in packages, we can also create our own functions to automate tasks. How to create our own function? Let’s look at its basic scheme:

  • Name: for example name_fun (no spaces or strange characters). To the name we assign the reserved word function().

  • Define input arguments (inside function()).

  • Body of the function inside { }.

  • We end the function with the output arguments with return().

name_fun <- function(arg1, arg2, ...) {
  
}

Own functions

Not only can we use default functions that come already loaded in packages, we can also create our own functions to automate tasks. How to create our own function? Let’s look at its basic scheme:

  • Name: for example name_fun (no spaces or strange characters). To the name we assign the reserved word function().

  • Define input arguments (inside function()).

  • Body of the function inside { }.

  • We end the function with the output arguments with return().

name_fun <- function(arg1, arg2, ...) {
  
  code to be executed
  
}

Own functions

Not only can we use default functions that come already loaded in packages, we can also create our own functions to automate tasks. How to create our own function? Let’s look at its basic scheme:

  • Name: for example name_fun (no spaces or strange characters). To the name we assign the reserved word function().

  • Define input arguments (inside function()).

  • Body of the function inside { }.

  • We end the function with the output arguments with return().

name_fun <- function(arg1, arg2, ...) {
  
  code to be executed
  
  return(var_output)
  
}

Own functions

  • arg1, arg2, ...: will be the input arguments, the arguments that the function takes to execute the code inside.

  • code: lines of code that we want to execute the function.

  • return(var_output): the output arguments will be entered.

name_fun <- function(arg1, arg2, ...) {
  
  # Code to be executed
  code
  
  # Output
  return(var_output)
  
}

Important

All variables that we define inside the function are LOCAL variables: they will only exist inside the function unless we specify otherwise.

Own functions

Let’s look at a very simple example of a function for calculating the area of a rectangle.

Since the area of a rectangle is calculated as the product of its sides, we will need just that, its sides: those will be the input arguments and the value to return will be just its area (\(side_1 * side_2\)).

# We define the name of function and input arguments
compute_area <- function(side_1, side_2) {
  
}

Own functions

Let’s look at a very simple example of a function for calculating the area of a rectangle.

Since the area of a rectangle is calculated as the product of its sides, we will need just that, its sides: those will be the input arguments and the value to return will be just its area (\(side_1 * side_2\)).

# We define the name of function and input arguments
compute_area <- function(side_1, side_2) {
  
  area <- side_1 * side_2
  
}

Own functions

Let’s look at a very simple example of a function for calculating the area of a rectangle.

Since the area of a rectangle is calculated as the product of its sides, we will need just that, its sides: those will be the input arguments and the value to return will be just its area (\(side_1 * side_2\)).

# We define the name of function and input arguments
compute_area <- function(side_1, side_2) {
  
  area <- side_1 * side_2
  return(area)
  
}

Own functions

We can also make a direct definition of variables without storing along the way.

# We define the name of function and input arguments
compute_area <- function(side_1, side_2) {
  
  return(side_1 * side_2)
  
}

How to apply our function?

compute_area(5, 3) # area of 5 x 3 rectangle
[1] 15
compute_area(1, 5) # area of 1 x 5 rectangle
[1] 5

Own functions

Consejo

Although it is not necessary, it is recommendable to make explicit the calling of the arguments, specifying in the code what value is for each argument so that it does not depend on its order, making the code more readable.

compute_area(side_1 = 5, side_2 = 3) # area of 5 x 3 rectangle
[1] 15
compute_area(side_2 = 3, side_1 = 5) # area of 5 x 3 rectangle
[1] 15

Default arguments

Imagine now that we realize that 90% of the time we use such a function to default calculate the area of a square (i.e., we only need one side). To do this, we can define default arguments in the function: they will take that value unless we assign another one.

Why not assign side_2 = side_1 default, to save lines of code and time?

compute_area <- function(side_1, side_2 = side_1) {
  
  # Code to be executed
  area <- side_1 * side_2
  
  # Output
  return(area)
  
}

Default arguments

compute_area <- function(side_1, side_2 = side_1) {
  
  # Code to be executed
  area <- side_1 * side_2
  
  # Output
  return(area)
  
}

Now default the second side will be equal to the first (if added it will use both).

compute_area(side_1 = 5) # square
[1] 25
compute_area(side_1 = 5, side_2 = 7) # rectangle
[1] 35

Multiple outputs

Let’s complicate the function a bit and add in the output the values of each side, labeled side_1 and side_2, packing the output in a vector.

compute_area <- function(side_1, side_2 = side_1) {
  
  # Code
  area <- side_1 * side_2
  
  # Output
  return(c("area" = area, "side_1" = side_1, "side_2" = side_2))
  
}

Multiple outputs

We can complicate the output a little more by adding a fourth variable that tells us, depending on the arguments, whether rectangle or square, having to add a character (or logic) variable in the output.

compute_area <- function(side_1, side_2 = side_1) {
  
  # Code
  area <- side_1 * side_2
  
  # Output
  return(c("area" = area, "side_1" = side_1, "side_2" = side_2,
           "type" = if_else(side_1 == side_2, "square", "rectangle")))
  
}
compute_area(5, 3)
       area      side_1      side_2        type 
       "15"         "5"         "3" "rectangle" 

Problem: when trying to put numbers and text together, it converts everything to numbers. We could store it all in a tibble() as we have learned or in an object known in R as lists (we will see it later).

Order of arguments

Before we did not care about the order of the arguments, but now the order of the input arguments matters, since we include side_1 and side_2 in the output.

Tip

As mentioned, it is highly recommended to make the function call explicitly setting the arguments to improve legibility and interpretability.

# Equivalent to compute_area(5, 3)
compute_area(side_1 = 5, side_2 = 3)
       area      side_1      side_2        type 
       "15"         "5"         "3" "rectangle" 

Generating knowledge

It seems silly what we have done but we have crossed an important frontier: we have gone from consuming knowledge (code from other packages, elaborated by others), to generating knowledge, creating our own functions.

Functions are going to be key in your day-to-day work because they will allow you to automate code that you are going to repeat over and over again: by packaging that code under an alias (function name) you will be able to use it over and over again without programming it (so doing twice as much work will not imply working twice as much)

Local vs global variables

An important aspect to think about with functions: what happens if we name a variable inside a function to which we have forgotten to assign a value inside the function.

We must be cautious when using functions in R, since due to the “lexicographic rule”, if a variable is not defined inside the function, R will look for that variable in the environment of variables.

x <- 1
fun_example <- function() {
    
  print(x) # No output, just doing an action
}
fun_example()
[1] 1

Local vs global variables

If a variable is already defined outside the function (global environment), and is also used inside changing its value, the value only changes inside but not in the global environment.

x <- 1
fun_example <- function() {
    
  x <- 2
  print(x) # value inside of function
}
# value inside of function (local)
fun_example()
[1] 2
# value output of function (global)
print(x)
[1] 1

Local vs global variables

If we want it to change locally as well as globally we must use the double assignment (<<-).

x <- 1
y <- 2
fun_example <- function() {
  
  # no change in a global way, just locally
  x <- 3 
  # change in a global way
  y <<- 0 #<<
  
  print(x)
  print(y)
}

fun_example() # value inside function (local)
[1] 3
[1] 0
x # global value
[1] 1
y # global value
[1] 0

💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Modify the code below to define a function called sum_function, so that given two elements, it returns their sum.

name <- function(x, y) {
  sum_output <- # code
  return()
}
# we apply the function
sum_function(3, 7)
Código
sum_function<- function(x, y) {
  sum_output <- x + y
  return(sum_output)
}
sum_function(3, 7)

📝 Modify the code below to define a function called product_function, so that given two elements, it returns their product, but by default it calculates the square

name <- function(x, y) {
  prod_output <- # code
  return()
}
product_function(3)
product_function(3, -7)
Código
product_function <- function(x, y = x) {
  
  prod_output <- x * y
  return(prod_output)
  
}
product_function(3)
product_function(3, -7)

📝 Define a function called equal_names that, given two names, tells us if they are equal or not. Do this by considering case-sensitive, and case-insensitive. Use the {stringr} package.

Código
# Case-sensitive
equal_names <- function(person_1, person_2) {
  
  return(person_1 == person_2)
  
}
equal_names("Javi", "javi")
equal_names("Javi", "Lucía")

# Case-insensitive
library(stringr)
equal_names <- function(person_1, person_2) {
  
  return(str_equal(person_1, person_2, ignore_case = TRUE))
  
  # other option
  # return(str_to_lower(person_1) == str_to_lower(person_2))
  
  # other option
  # return(str_to_upper(person_1) == str_to_upper(person_2))
  
}
equal_names("Javi", "javi")
equal_names("Javi", "Lucía")

📝 Create a function called compute_BMI that, given two arguments (weight and height in meters) and a name, returns a list with the BMI (\(weight/(height^2)\)) and the name.

Código
compute_BMI <- function(name, weight, height) {
  
  return(list("name" = name, "BMI" = weight/(height^2)))
  
}

📝 Repeat the previous exercise but with another optional argument called units (by default, units = “meters”). Develop the function so that it does the right thing if units = “meters” and if units = “centimeters”.

Código
compute_BMI <- function(name, weight, height, units = "meters") {
  
  if (units == "meters") {
    
    return(list("name" = name, "BMI" = weight / (height^2)))
  } else if (units == "centimeters") {
    
    return(list("name" = name, "BMI" = weight / ((height/100)^2)))
  }
}

📝 Create a fictitious tibble of 7 persons, with three variables (invent name, and simulate weight, height in centimeters), and apply the defined function so that we obtain a fourth column with their BMI.

Código
data <-
  tibble("name" = c("javi", "sandra", "laura",
                       "ana", "carlos", "leo", NA),
         "weight" = rnorm(n = 7, mean = 70, sd = 1),
         "height" = rnorm(n = 7, mean = 168, sd = 5))

data |> 
  mutate(BMI = compute_BMI(name, weight, height, units = "centimeters")$BMI)

📝 Create a function called shortcut that has two numeric arguments x and y. If both are equal, you should return equal and have the function terminate automatically (think about when a function exits). WARNING: x and y could be vectors. If they are different (of equal length) calculate the proportion of different elements. If they are different (being of different length), it returns the elements that are not common.

Código
shortcut <- function(x, y) {
  
  if (all(x == y) & length(x) == length(y)) { return("equal") }
  else {
   
    if (length(x) == length(y)) {
      
      n_diff <- sum(x != y) / length(x)
      return(n_diff)
      
    } else {
      
      diff_elem <- unique(c(setdiff(x, y), setdiff(y, x)))
      return(diff_elem)
    }
    
  }
}

🐣 Case study I: converter

To practice with functions we are going to create a complete temperature converter that, given a temperature in Fahrenheit, Celsius or Kelvin, converts it to any of the others.

 

Try to answer the questions posed in the workbook

🐣 Case study II: covid data

Pending to submit

Joins

When working with data we will not always have the information in a single table, and sometimes, we will be interested in cross-referencing information from different sources.

For this we will use a classic of every language that handles data: the famous join, a tool that will allow us to cross one or several tables, making use of a identifying column of each one of them.

table_1 |>
  xxx_join(table_2, by = id)

Joins

  • inner_join(): only records with id in both tables survive.

  • full_join(): keeps all records in both tables.

  • left_join(): keeps all the records of the first table, and looks for which ones have id also in the second one (in case of not having it, it fills with NA the fields of the 2nd table).

  • right_join(): keeps all records in the second table, and searches which ones have id also in the first one.

Joins

Let’s test the various joins with a simple example

tb_1 <- tibble("key" = 1:3, "val_x" = c("x1", "x2", "x3"))
tb_2 <- tibble("key" = c(1, 2, 4), "val_y" = c("y1", "y2", "y3"))
tb_1
# A tibble: 3 × 2
    key val_x
  <int> <chr>
1     1 x1   
2     2 x2   
3     3 x3   
tb_2
# A tibble: 3 × 2
    key val_y
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     4 y3   

left_join()

Imagine that we want to incorporate to tb_1 the information from table_2, identifying the records by the key column (by = "key", the column it has to cross): we want to keep all the records of the first table and look for which ones have the same value in key also in the second one.

tb_1 |> 
  left_join(tb_2, by = "key")
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

left_join()

tb_1 |>
  left_join(tb_2, by = "key")
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

Notice that the records in the first one whose key was not found in the second one has given them the value of absent.

right_join()

The right_join() will perform the opposite operation: we will now incorporate to tb_2 the information from table_2, identifying the records by the key column: we want to keep all the records of the second one and look for which ones have id (same value in key) also in the first table.

tb_1 |> 
  right_join(tb_2, by = "key")
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y3   

right_join()

tb_1 |>
  right_join(tb_2, by = "key")
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y3   

Notice that now the records of the second one whose key was not found in the first one are the ones given the value of absent.

keys and suffixes

The key columns we will use for the crossover will not always be named the same.

tb_1 <- tibble("key_1" = 1:3, "val_x" = c("x1", "x2", "x3"))
tb_2 <- tibble("key_2" = c(1, 2, 4), "val_y" = c("y1", "y2", "y3"))
  • by = c("key_2" = "key_2"): we will indicate in which column of each table are the keys that we are going to cross.
# Left
tb_1  |> 
  left_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 3 × 3
  key_1 val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
# Right
tb_1 |> 
  right_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 3 × 3
  key_1 val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y3   

keys and suffixes

We can also cross over several columns at the same time (it will interpret as equal record the one that has the same set of keys), with by = c("var1_t1" = "var1_t2", "var2_t1" = "var2_t2", ...). Let’s modify the previous example

tb_1 <- tibble("k_11" = 1:3, "k_12" = c("a", "b", "c"),  "val_x" = c("x1", "x2", "x3"))
tb_2 <- tibble("k_21" = c(1, 2, 4), "k_22" = c("a", "b", "e"), "val_y" = c("y1", "y2", "y3"))
# Left
tb_1 |> 
  left_join(tb_2,
            by = c("k_11" = "k_21", "k_12" = "k_22"))
# A tibble: 3 × 4
   k_11 k_12  val_x val_y
  <dbl> <chr> <chr> <chr>
1     1 a     x1    y1   
2     2 b     x2    y2   
3     3 c     x3    <NA> 
# Right
tb_1 |> 
  right_join(tb_2,
             by = c("k_11" = "k_21", "k_12" = "k_22"))
# A tibble: 3 × 4
   k_11 k_12  val_x val_y
  <dbl> <chr> <chr> <chr>
1     1 a     x1    y1   
2     2 b     x2    y2   
3     4 e     <NA>  y3   

keys and suffixes

It could also happen that when crossing two tables, there are columns of values named the same

tb_1 <- tibble("key_1" = 1:3, "val" = c("x1", "x2", "x3"))
tb_2 <- tibble("key_2" = c(1, 2, 4), "val" = c("y1", "y2", "y3"))
# Left
tb_1 |>
  left_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 3 × 3
  key_1 val.x val.y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

Notice that default adds the suffixes .x and .y to tell us which table they come from.

keys and suffixes

This suffix can be specified in the optional argument suffix = ..., which allows us to distinguish the variables of one table from another.

# Left
tb_1 |>
  left_join(tb_2, by = c("key_1" = "key_2"),
            suffix = c("_table1", "_table2"))
# A tibble: 3 × 3
  key_1 val_table1 val_table2
  <dbl> <chr>      <chr>     
1     1 x1         y1        
2     2 x2         y2        
3     3 x3         <NA>      

full_join()

The two previous cases form what is known as outer joins: crosses where observations are kept that appear in at least one table. The third outer join is known as full_join() which will keep observations from both tables, adding rows that do not match the other table.

tb_1  |> 
  full_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 4 × 3
  key_1 val.x val.y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y3   

inner_join()

Opposite the outer join is what is known as inner join, with inner_join(): a join in which only the observations that appear in both tables are kept, only those records that are patched are kept.

tb_1 |> 
  inner_join(tb_2,  by = c("key_1" = "key_2"))
# A tibble: 2 × 3
  key_1 val.x val.y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   

inner_join()

Note that in terms of records, inner_join if it is commutative, we don’t care about the order of the tables: the only thing that changes is the order of the columns it adds.

tb_1 |> 
  inner_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 2 × 3
  key_1 val.x val.y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
tb_2 |> inner_join(tb_1, by = c("key_2" = "key_1"))
# A tibble: 2 × 3
  key_2 val.x val.y
  <dbl> <chr> <chr>
1     1 y1    x1   
2     2 y2    x2   

semi/anti_join()

Finally we have two interesting tools to filter (not cross) records: semi_join() and anti_join(). The semi join leaves us in the first table the records whose key is also in the second table (like an inner join but without adding the info from the second table). And the second one, the anti join, does just the opposite (those that are not).

# semijoin
tb_1 |> 
  semi_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 2 × 2
  key_1 val  
  <int> <chr>
1     1 x1   
2     2 x2   
# antijoin
tb_1 |> 
  anti_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 1 × 2
  key_1 val  
  <int> <chr>
1     3 x3   

💻 It’s your turn

For the exercises we will use the tables available in the package {nycflights13}.

library(nycflights13)
  • airlines: name of airlines (with their abbreviation).
  • airports: airport data (names, longitude, latitude, altitude, etc).
  • flights, planes: flight and aircraft data.
  • weather: hourly weather data.

💻 It’s your turn

📝 From package {nycflights13} incorporates into the flights table the airline data in airlines. We want to maintain all flight records, adding the airline information to the airlines table.

Código
flights_airlines <-
  flights |> 
  left_join(airlines, by = "carrier")
flights_airlines

📝 To the table obtained from the crossing of the previous section, include the data of the aircraft in planes, but including only those flights for which we have information on their aircraft (and vice versa).

Código
flights_airlines_planes <- 
  flights_airlines |> 
  inner_join(planes, by = "tailnum")
flights_airlines_planes

📝 Repeat the previous exercise but keeping both year variables (in one is the year of flight, in the other is the year of construction of the aircraft), and distinguishing them from each other

Código
flights_airlines_planes <- 
  flights_airlines |>
  inner_join(planes, by = "tailnum",
             suffix = c("_flight", "_build_aircraft"))
flights_airlines_planes

📝 To the table obtained from the previous exercise includes the longitude and latitude of the airports in airports, distinguishing between the latitude/longitude of the airport at destination and at origin.

Código
flights_airlines_planes |> 
  left_join(airports |> select(faa, lat, lon),
            by = c("origin" = "faa")) |> 
  rename(lat_origin = lat, lon_origin = lon) |> 
  left_join(airports |> select(faa, lat, lon),
            by = c("dest" = "faa")) |> 
  rename(lat_dest = lat, lon_dest = lon)

📝 Filter from airports only those airports from which flights depart. Repeat the process filtering only those airports where flights arrive.

Código
airports |> 
  semi_join(flights, by = c("faa" = "origin"))
airports |> 
  semi_join(flights, by = c("faa" = "dest"))

📝 How many flights do we not have information about the aircraft? Eliminate flights that do not have an aircraft ID (other than NA) beforehand.

Código
flights |>
  drop_na(tailnum) |>
  anti_join(planes, by = "tailnum") |>
  count(tailnum, sort = TRUE) 

🐣 Case study III: pending

pending to submit about joins + import/export

L7: import/exports

Import and export data

Import/export

So far we have only used data already loaded in packages but many times we will need to import data externally. One of the main strengths of R is that we can import data very easily in different formats:

  • R native formats: .rda, .RData and .rds formats.

  • Rectangular (tabular) data: .csv and .tsv formats

  • Untabulated data: .txt format.

  • Data in excel: .xls and .xlsx formats

  • Data from SAS/Stata/SPSS: .sas7bdat, .sav and .dat formats

  • Data from Google Drive

  • Data from API’s: aemet, catastro, twitter, spotify, etc.

R native formats

The simplest files to import into R (and which usually take up less disk space) are its own native extensions: files in .RData, .rda and .rds formats. To load the former we simply need to use the native function load() by providing it the file path.

  • RData file: we are going to import the file world_bank_pop.RData, which includes the dataset world_bank_pop
load("./data/world_bank_pop.RData")
world_bank_pop
# A tibble: 1,064 × 20
   country indicator      `2000`  `2001`  `2002`  `2003`  `2004`  `2005`  `2006`
   <chr>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ABW     SP.URB.TOTL    4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
 2 ABW     SP.URB.GROW    1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
 3 ABW     SP.POP.TOTL    8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
 4 ABW     SP.POP.GROW    2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
 5 AFE     SP.URB.TOTL    1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
 6 AFE     SP.URB.GROW    3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
 7 AFE     SP.POP.TOTL    4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
 8 AFE     SP.POP.GROW    2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
 9 AFG     SP.URB.TOTL    4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG     SP.URB.GROW    1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
#   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#   `2016` <dbl>, `2017` <dbl>

R native formats

  • .rda file: we will import the airquality dataset from airquality.rda
load("./data/airquality.rda")
airquality |> as_tibble()
# A tibble: 153 × 6
   Ozone Solar.R  Wind  Temp Month   Day
   <int>   <int> <dbl> <int> <int> <int>
 1    41     190   7.4    67     5     1
 2    36     118   8      72     5     2
 3    12     149  12.6    74     5     3
 4    18     313  11.5    62     5     4
 5    NA      NA  14.3    56     5     5
 6    28      NA  14.9    66     5     6
 7    23     299   8.6    65     5     7
 8    19      99  13.8    59     5     8
 9     8      19  20.1    61     5     9
10    NA     194   8.6    69     5    10
# ℹ 143 more rows

R native formats

Note that files loaded with load() are automatically loaded into the environment (with the originally saved name), and not only datasets can be loaded: load() allows us to load multiple objetcs (not only a tabular data)

Native .rda and .RData files are a properly way to save your environment.

load(file = "./data/multiple_objects.rda")

R native formats

  • .rds files: for this type we must use readRDS(), and we need to incorporate a argument file with the path. In this case we are going to import lung cancer data from the North Central Cancer Treatment Group. Note that now .rds import files are a unique database
lung_cancer <-
  readRDS(file = "./data/NCCTG_lung_cancer.rds") |>
  as_tibble()

Important

The paths must always be without spaces, ñ, or accents.

Tabular data: readr

The {readr} package within the {tidyverse} environment contains several useful functions for loading rectangular data (without formatting).

  • read_csv(): .csv files whose separator is comma
  • read_csv2(): semicolon
  • read_tsv(): tabulator.
  • read_table(): space.
  • read_delim(): generic function for character delimited files.

All of them need as argument the file path plus other optional (skip header or not, decimals, etc). See more at https://readr.tidyverse.org/

Tabular data (.csv, .tsv)

The main advantage of {readr} is that it automates formatting to go from a flat (unformatted) file to a tibble (in rows and columns, with formatting).

  • File .csv: with read_csv() we will load comma separated files, passing as argument the path in file = .... Let’s import the chickens.csv dataset (about cartoon chickens, why not). If you look at the output it gives us the type of variables.
library(readr)
chickens <- read_csv(file = "./data/chickens.csv")
chickens
# A tibble: 5 × 4
  chicken                 sex     eggs_laid motto                               
  <chr>                   <chr>       <dbl> <chr>                               
1 Foghorn Leghorn         rooster         0 That's a joke, ah say, that's a jok…
2 Chicken Little          hen             3 The sky is falling!                 
3 Ginger                  hen            12 Listen. We'll either die free chick…
4 Camilla the Chicken     hen             7 Bawk, buck, ba-gawk.                
5 Ernie The Giant Chicken rooster         0 Put Captain Solo in the cargo hold. 

Tabular data (.csv, .tsv)

The variable format will normally be done automatically by read_csv(), and we can query it with spec().

spec(chickens)
cols(
  chicken = col_character(),
  sex = col_character(),
  eggs_laid = col_double(),
  motto = col_character()
)

Tabular data (.csv, .tsv)

Although it usually does it well automatically we can specify the format explicitly in col_types = list() (in list format, with col_xxx() for each type of variable, for example eggs_laid will be imported as character).

chickens <-
  read_csv(file = "./data/chickens.csv",
           col_types = list(col_character(), col_character(),
                            col_character(), col_character()))
chickens
# A tibble: 5 × 4
  chicken                 sex     eggs_laid motto                               
  <chr>                   <chr>   <chr>     <chr>                               
1 Foghorn Leghorn         rooster 0         That's a joke, ah say, that's a jok…
2 Chicken Little          hen     3         The sky is falling!                 
3 Ginger                  hen     12        Listen. We'll either die free chick…
4 Camilla the Chicken     hen     7         Bawk, buck, ba-gawk.                
5 Ernie The Giant Chicken rooster 0         Put Captain Solo in the cargo hold. 

Tabular data (.csv, .tsv)

We can even indicate that variables we want to select (without occupying memory), by indicating it in col_select = ... (in list format, with col_select = ...).

chickens <-
  read_csv(file = "./data/chickens.csv",
           col_select = c(chicken, sex, eggs_laid))
chickens
# A tibble: 5 × 3
  chicken                 sex     eggs_laid
  <chr>                   <chr>       <dbl>
1 Foghorn Leghorn         rooster         0
2 Chicken Little          hen             3
3 Ginger                  hen            12
4 Camilla the Chicken     hen             7
5 Ernie The Giant Chicken rooster         0

Untabulated data (.txt)

What happens when the separator is not correct?

If we use read_csv() it expects the separator between columns to be a comma but, as you can see with the following .txt, it interprets everything as a single column: has no comma and does not know where to separate

datos_txt <- read_csv(file = "./data/massey-rating.txt")
dim(datos_txt)
[1] 10  1
as_tibble(datos_txt)
# A tibble: 10 × 1
   `UCC PAY LAZ KPK  RT   COF BIH DII ENG ACU Rank Team            Conf`
   <chr>                                                                
 1 1   1   1   1   1     1   1   1   1   1    1 Ohio St          B10    
 2 2   2   2   2   2     2   2   2   4   2    2 Oregon           P12    
 3 3   4   3   4   3     4   3   4   2   3    3 Alabama          SEC    
 4 4   3   4   3   4     3   5   3   3   4    4 TCU              B12    
 5 6   6   6   5   5     7   6   5   6  11    5 Michigan St      B10    
 6 7   7   7   6   7     6  11   8   7   8    6 Georgia          SEC    
 7 5   5   5   7   6     8   4   6   5   5    7 Florida St       ACC    
 8 8   8   9   9  10     5   7   7  10   7    8 Baylor           B12    
 9 9  11   8  13  11    11  12   9  14   9    9 Georgia Tech     ACC    
10 13  10  13  11   8     9  10  11   9  10   10 Mississippi      SEC   

Untabulated data (.txt)

To do this we have.

  • read_csv2() when the separator is semicolon, read_tsv() when the is a tab and read_table() when the is a space.

  • read_delim() in general.

datos_txt <- read_table(file = "./data/massey-rating.txt")
as_tibble(datos_txt)
# A tibble: 10 × 13
     UCC   PAY   LAZ   KPK    RT   COF   BIH   DII   ENG   ACU  Rank Team  Conf 
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
 1     1     1     1     1     1     1     1     1     1     1     1 Ohio  St   
 2     2     2     2     2     2     2     2     2     4     2     2 Oreg… P12  
 3     3     4     3     4     3     4     3     4     2     3     3 Alab… SEC  
 4     4     3     4     3     4     3     5     3     3     4     4 TCU   B12  
 5     6     6     6     5     5     7     6     5     6    11     5 Mich… St   
 6     7     7     7     6     7     6    11     8     7     8     6 Geor… SEC  
 7     5     5     5     7     6     8     4     6     5     5     7 Flor… St   
 8     8     8     9     9    10     5     7     7    10     7     8 Bayl… B12  
 9     9    11     8    13    11    11    12     9    14     9     9 Geor… Tech 
10    13    10    13    11     8     9    10    11     9    10    10 Miss… SEC  

Excel data (.xls, .xlsx)

Another key import package will be the {readxl} package for importing data from Excel. Three functions will be key:

  • read_xls() specific to .xls, read_xlsx() specific to .xlsx.
  • read_excel(): for both .xls and .xlsx.

We are going to import deaths.xlsx with celebrity death records.

library(readxl)
deaths <- read_xlsx(path = "./data/deaths.xlsx")
deaths
# A tibble: 18 × 6
   `Lots of people`             ...2                     ...3  ...4  ...5  ...6 
   <chr>                        <chr>                    <chr> <chr> <chr> <chr>
 1 simply cannot resist writing <NA>                     <NA>  <NA>  <NA>  some…
 2 at                           the                      top   <NA>  of    thei…
 3 or                           merging                  <NA>  <NA>  <NA>  cells
 4 Name                         Profession               Age   Has … Date… Date…
 5 David Bowie                  musician                 69    TRUE  17175 42379
 6 Carrie Fisher                actor                    60    TRUE  20749 42731
 7 Chuck Berry                  musician                 90    TRUE  9788  42812
 8 Bill Paxton                  actor                    61    TRUE  20226 42791
 9 Prince                       musician                 57    TRUE  21343 42481
10 Alan Rickman                 actor                    69    FALSE 16854 42383
11 Florence Henderson           actor                    82    TRUE  12464 42698
12 Harper Lee                   author                   89    FALSE 9615  42419
13 Zsa Zsa Gábor                actor                    99    TRUE  6247  42722
14 George Michael               musician                 53    FALSE 23187 42729
15 Some                         <NA>                     <NA>  <NA>  <NA>  <NA> 
16 <NA>                         also like to write stuff <NA>  <NA>  <NA>  <NA> 
17 <NA>                         <NA>                     at t… bott… <NA>  <NA> 
18 <NA>                         <NA>                     <NA>  <NA>  <NA>  too! 

Excel data (.xls, .xlsx)

deaths |> slice(1:6)
# A tibble: 6 × 6
  `Lots of people`             ...2       ...3  ...4     ...5          ...6     
  <chr>                        <chr>      <chr> <chr>    <chr>         <chr>    
1 simply cannot resist writing <NA>       <NA>  <NA>     <NA>          some not…
2 at                           the        top   <NA>     of            their sp…
3 or                           merging    <NA>  <NA>     <NA>          cells    
4 Name                         Profession Age   Has kids Date of birth Date of …
5 David Bowie                  musician   69    TRUE     17175         42379    
6 Carrie Fisher                actor      60    TRUE     20749         42731    

One thing that is very common misfortune is that there is some kind of comment or text at the beginning of the file, having to skip those rows.

Excel data (.xls, .xlsx)

We can skip these rows directly in the load with skip = ... (indicating the number of rows to skip).

deaths <- read_xlsx(path = "./data/deaths.xlsx", skip = 4)
deaths
# A tibble: 14 × 6
   Name          Profession Age   `Has kids` `Date of birth`     `Date of death`
   <chr>         <chr>      <chr> <chr>      <dttm>              <chr>          
 1 David Bowie   musician   69    TRUE       1947-01-08 00:00:00 42379          
 2 Carrie Fisher actor      60    TRUE       1956-10-21 00:00:00 42731          
 3 Chuck Berry   musician   90    TRUE       1926-10-18 00:00:00 42812          
 4 Bill Paxton   actor      61    TRUE       1955-05-17 00:00:00 42791          
 5 Prince        musician   57    TRUE       1958-06-07 00:00:00 42481          
 6 Alan Rickman  actor      69    FALSE      1946-02-21 00:00:00 42383          
 7 Florence Hen… actor      82    TRUE       1934-02-14 00:00:00 42698          
 8 Harper Lee    author     89    FALSE      1926-04-28 00:00:00 42419          
 9 Zsa Zsa Gábor actor      99    TRUE       1917-02-06 00:00:00 42722          
10 George Micha… musician   53    FALSE      1963-06-25 00:00:00 42729          
11 Some          <NA>       <NA>  <NA>       NA                  <NA>           
12 <NA>          also like… <NA>  <NA>       NA                  <NA>           
13 <NA>          <NA>       at t… bottom,    NA                  <NA>           
14 <NA>          <NA>       <NA>  <NA>       NA                  too!           

Excel data (.xls, .xlsx)

In addition with col_names = ... we can already rename the columns in the import (if provide names assumes 1st line already as a data)

deaths <-
  read_xlsx(path = "./data/deaths.xlsx", skip = 5,
            col_names = c("name", "profession", "age", "kids", "birth", "death"))
deaths
# A tibble: 14 × 6
   name               profession           age   kids  birth               death
   <chr>              <chr>                <chr> <chr> <dttm>              <chr>
 1 David Bowie        musician             69    TRUE  1947-01-08 00:00:00 42379
 2 Carrie Fisher      actor                60    TRUE  1956-10-21 00:00:00 42731
 3 Chuck Berry        musician             90    TRUE  1926-10-18 00:00:00 42812
 4 Bill Paxton        actor                61    TRUE  1955-05-17 00:00:00 42791
 5 Prince             musician             57    TRUE  1958-06-07 00:00:00 42481
 6 Alan Rickman       actor                69    FALSE 1946-02-21 00:00:00 42383
 7 Florence Henderson actor                82    TRUE  1934-02-14 00:00:00 42698
 8 Harper Lee         author               89    FALSE 1926-04-28 00:00:00 42419
 9 Zsa Zsa Gábor      actor                99    TRUE  1917-02-06 00:00:00 42722
10 George Michael     musician             53    FALSE 1963-06-25 00:00:00 42729
11 Some               <NA>                 <NA>  <NA>  NA                  <NA> 
12 <NA>               also like to write … <NA>  <NA>  NA                  <NA> 
13 <NA>               <NA>                 at t… bott… NA                  <NA> 
14 <NA>               <NA>                 <NA>  <NA>  NA                  too! 

Excel data (.xls, .xlsx)

Sometimes Excel dates are incorrectly formatted (surprise): we can use convertToDate() from the {openxlsx} package to convert it.

library(openxlsx)
deaths$death <- convertToDate(deaths$death)
deaths
# A tibble: 14 × 6
   name               profession      age   kids  birth               death     
   <chr>              <chr>           <chr> <chr> <dttm>              <date>    
 1 David Bowie        musician        69    TRUE  1947-01-08 00:00:00 2016-01-10
 2 Carrie Fisher      actor           60    TRUE  1956-10-21 00:00:00 2016-12-27
 3 Chuck Berry        musician        90    TRUE  1926-10-18 00:00:00 2017-03-18
 4 Bill Paxton        actor           61    TRUE  1955-05-17 00:00:00 2017-02-25
 5 Prince             musician        57    TRUE  1958-06-07 00:00:00 2016-04-21
 6 Alan Rickman       actor           69    FALSE 1946-02-21 00:00:00 2016-01-14
 7 Florence Henderson actor           82    TRUE  1934-02-14 00:00:00 2016-11-24
 8 Harper Lee         author          89    FALSE 1926-04-28 00:00:00 2016-02-19
 9 Zsa Zsa Gábor      actor           99    TRUE  1917-02-06 00:00:00 2016-12-18
10 George Michael     musician        53    FALSE 1963-06-25 00:00:00 2016-12-25
11 Some               <NA>            <NA>  <NA>  NA                  NA        
12 <NA>               also like to w… <NA>  <NA>  NA                  NA        
13 <NA>               <NA>            at t… bott… NA                  NA        
14 <NA>               <NA>            <NA>  <NA>  NA                  NA        

Excel data (.xls, .xlsx)

We can also load an Excel with several sheets: to indicate the sheet (either by its name or by its number) we will use the argument sheet = ....

mtcars <- read_xlsx(path = "./data/datasets.xlsx", sheet = "mtcars")
mtcars
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

Excel data (.xls, .xlsx)

We can even indicate the range of cells to load with range = ....

iris <- read_xlsx(path = "./data/datasets.xlsx", sheet = "iris", range = "C1:E4")
iris
# A tibble: 3 × 3
  Petal.Length Petal.Width Species
         <dbl>       <dbl> <chr>  
1          1.4         0.2 setosa 
2          1.4         0.2 setosa 
3          1.3         0.2 setosa 

Import from SAS/STATA/SPSS

The {haven} package within the tidyverse orbit will allow us to import files from the 3 most important payment software: SAS, SPSS and Stata.

library(haven)

# SAS
iris_sas <- read_sas(data_file = "./data/iris.sas7bdat")

# SPSS
iris_spss <- read_sav(file = "./data/iris.sav")

# Stata
iris_stata <- read_dta(file = "./data/iris.dta")

Import from SPSS

Export

In the same way that we can import we can also export

  • exported in .RData (recommended option for variables stored in R). Remember that this extension can only be used in R. To do so, just use save(object, file = path).
table <- tibble("a" = 1:4, "b" = 1:4)
save(table, file = "./data/table.RData")
rm(table) # eliminar
load("./data/table.RData")
table
# A tibble: 4 × 2
      a     b
  <int> <int>
1     1     1
2     2     2
3     3     3
4     4     4

Export

In the same way that we can import we can also export

  • exported in .RData multiple objects
table <- tibble("a" = 1:4, "b" = 1:4)
a <- 1
b <- c("javi", "sandra")
save(table, a, b, file = "./data/mult_obj.RData")
rm(list = c("a", "b", "table"))
load("./data/mult_obj.RData")
table
# A tibble: 4 × 2
      a     b
  <int> <int>
1     1     1
2     2     2
3     3     3
4     4     4

Export

In the same way that we can import we can also export

  • exported in .csv. To do this we simply use write_csv(object, file = path).
write_csv(table, file = "./data/table.csv")
read_csv(file = "./data/table.csv")
# A tibble: 4 × 2
      a     b
  <dbl> <dbl>
1     1     1
2     2     2
3     3     3
4     4     4

Import from website

One of the main advantages of R is that we can make use of all the previous functions of import but directly from a web, without the need to perform the manual download: instead of passing it the local path we will indicate the link. For example, we are going to download the covid data from ISCIII (https://cnecovid.isciii.es/covid19/#documentaci%C3%B3n-y-datos)

covid_data <-
  read_csv(file = "https://cnecovid.isciii.es/covid19/resources/casos_hosp_uci_def_sexo_edad_provres.csv")
covid_data
# A tibble: 500 × 8
   provincia_iso sexo  grupo_edad fecha      num_casos num_hosp num_uci num_def
   <chr>         <chr> <chr>      <date>         <dbl>    <dbl>   <dbl>   <dbl>
 1 A             H     0-9        2020-01-01         0        0       0       0
 2 A             H     10-19      2020-01-01         0        0       0       0
 3 A             H     20-29      2020-01-01         0        0       0       0
 4 A             H     30-39      2020-01-01         0        0       0       0
 5 A             H     40-49      2020-01-01         0        0       0       0
 6 A             H     50-59      2020-01-01         0        0       0       0
 7 A             H     60-69      2020-01-01         0        0       0       0
 8 A             H     70-79      2020-01-01         0        0       0       0
 9 A             H     80+        2020-01-01         0        0       0       0
10 A             H     NC         2020-01-01         0        0       0       0
# ℹ 490 more rows

Import from wikipedia

The {rvest} package, one of the most useful of {tidyverse} allows us to import directly from an html. For example, to export wikipedia tables just read_html() to import the html, html_element("table") to extract the table objects, and html_table() to convert the html table to tibble.

library(rvest)
wiki_jump <- 'https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression'
wiki_jump |> read_html() |> 
  html_element("table") |> 
  html_table()
# A tibble: 19 × 5
   Mark                       Wind   Athlete                  Place        Date 
   <chr>                      <chr>  <chr>                    <chr>        <chr>
 1 7.61 m (24 ft 11+1⁄2 in)   ""     Peter O'Connor (IRE)     Dublin, Ire… 5 Au…
 2 7.69 m (25 ft 2+3⁄4 in)    ""     Edward Gourdin (USA)     Cambridge, … 23 J…
 3 7.76 m (25 ft 5+1⁄2 in)    ""     Robert LeGendre (USA)    Paris, Fran… 7 Ju…
 4 7.89 m (25 ft 10+1⁄2 in)   ""     DeHart Hubbard (USA)     Chicago, Un… 13 J…
 5 7.90 m (25 ft 11 in)       ""     Edward Hamm (USA)        Cambridge, … 7 Ju…
 6 7.93 m (26 ft 0 in)        "0.0"  Sylvio Cator (HAI)       Paris, Fran… 9 Se…
 7 7.98 m (26 ft 2 in)        "0.5"  Chuhei Nambu (JPN)       Tokyo, Japan 27 O…
 8 8.13 m (26 ft 8 in)        "1.5"  Jesse Owens (USA)        Ann Arbor, … 25 M…
 9 8.21 m (26 ft 11 in)       "0.0"  Ralph Boston (USA)       Walnut, Uni… 12 A…
10 8.24 m (27 ft 1⁄4 in)      "1.8"  Ralph Boston (USA)       Modesto, Un… 27 M…
11 8.28 m (27 ft 1+3⁄4 in)    "1.2"  Ralph Boston (USA)       Moscow, Sov… 16 J…
12 8.31 m (27 ft 3 in) A      "−0.1" Igor Ter-Ovanesyan (URS) Yerevan, So… 10 J…
13 8.33 m (27 ft 3+3⁄4 in)[2] ""     Phil Shinnick (USA)      Modesto, Un… 25 M…
14 8.31 m (27 ft 3 in)        "0.0"  Ralph Boston (USA)       Kingston, J… 15 A…
15 8.34 m (27 ft 4+1⁄4 in)    "1.0"  Ralph Boston (USA)       Los Angeles… 12 S…
16 8.35 m (27 ft 4+1⁄2 in)[5] "0.0"  Ralph Boston (USA)       Modesto, Un… 29 M…
17 8.35 m (27 ft 4+1⁄2 in) A  "0.0"  Igor Ter-Ovanesyan (URS) Mexico City… 19 O…
18 8.90 m (29 ft 2+1⁄4 in) A  "2.0"  Bob Beamon (USA)         Mexico City… 18 O…
19 8.95 m (29 ft 4+1⁄4 in)    "0.3"  Mike Powell (USA)        Tokyo, Japan 30 A…

Import from google drive

Another option available (especially if we work with other people working) is to import from a Google Drive spreadsheet, making use of read_sheet() from the {googlesheets4} package.

The first time you will be asked for a tidyverse permission to interact with your drive

library(googlesheets4)
google_sheet <-
  read_sheet("https://docs.google.com/spreadsheets/d/1Uz38nHjl3bmftxDpcXj--DYyPo1I39NHVf-xjeg1_wI/edit?usp=sharing")
google_sheet
# A tibble: 3 × 4
  a         b     c d     
  <chr> <dbl> <dbl> <chr> 
1 d1      0.5     1 hombre
2 d2     -0.3     2 hombre
3 d3      1.1     3 mujer 

Import from API (owid)

Another interesting option is the data download from an API: an intermediary between an app or data provider and our R. For example, let’s load the {owidR} library, which allows us to download data from the web https://ourworldindata.org/. For example, the owid_covid() function loads without realizing it more than 400 000 records with more than 60 variables from 238 countries.

library(owidR)
owid_covid() |> as_tibble()
# A tibble: 7 × 67
  iso_code continent location    date       total_cases new_cases
  <chr>    <chr>     <chr>       <IDate>          <int>     <int>
1 AFG      Asia      Afghanistan 2020-01-05           0         0
2 AFG      Asia      Afghanistan 2020-01-06           0         0
3 AFG      Asia      Afghanistan 2020-01-07           0         0
4 AFG      Asia      Afghanistan 2020-01-08           0         0
5 AFG      Asia      Afghanistan 2020-01-09           0         0
6 AFG      Asia      Afghanistan 2020-01-10           0         0
7 AFG      Asia      Afghanistan 2020-01-11           0         0
# ℹ 61 more variables: new_cases_smoothed <dbl>, total_deaths <int>,
#   new_deaths <int>, new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
#   new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
#   total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
#   new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
#   icu_patients <int>, icu_patients_per_million <dbl>, hosp_patients <int>,
#   hosp_patients_per_million <dbl>, weekly_icu_admissions <int>, …

Import from API (aemet)

In many occasions to connect to the API we will first have to register and obtain a key, this is the case of the {climaemet} package to access Spanish meteorological data (https://opendata.aemet.es/centrodedescargas/inicio).

Once we have the API key we register it in our RStudio to be able to use it in the future.

library(climaemet)

# Api key
apikey <- "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJqYXZhbHYwOUB1Y20uZXMiLCJqdGkiOiI4YTU1ODUxMS01MTE3LTQ4MTYtYmM4OS1hYmVkNDhiODBkYzkiLCJpc3MiOiJBRU1FVCIsImlhdCI6MTY2NjQ2OTcxNSwidXNlcklkIjoiOGE1NTg1MTEtNTExNy00ODE2LWJjODktYWJlZDQ4YjgwZGM5Iiwicm9sZSI6IiJ9.HEMR77lZy2ASjmOxJa8ppx2J8Za1IViurMX3p1reVBU"

aemet_api_key(apikey, install = TRUE)

Import from API (aemet)

With this package we can do a search for stations to know both its postal code and its identifier code within the AEMET network

stations <- aemet_stations()
stations
# A tibble: 947 × 7
   indicativo indsinop nombre                 provincia altitud longitud latitud
   <chr>      <chr>    <chr>                  <chr>       <dbl>    <dbl>   <dbl>
 1 B013X      "08304"  ESCORCA, LLUC          ILLES BA…     490     2.89    39.8
 2 B051A      "08316"  SÓLLER, PUERTO         ILLES BA…       5     2.69    39.8
 3 B087X      ""       BANYALBUFAR            ILLES BA…      60     2.51    39.7
 4 B103B      "99103"  ANDRATX - SANT ELM     ILLES BA…      52     2.37    39.6
 5 B158X      ""       CALVIÀ, ES CAPDELLÀ    ILLES BA…      50     2.47    39.6
 6 B228       "08301"  PALMA, PUERTO          ILLES BA…       3     2.63    39.6
 7 B236C      ""       PALMA, UNIVERSIDAD     ILLES BA…      95     2.64    39.6
 8 B248       "08303"  SIERRA DE ALFABIA, BU… ILLES BA…    1030     2.71    39.7
 9 B275E      "08302"  SON BONET, AEROPUERTO  BALEARES       49     2.71    39.6
10 B278       "08306"  PALMA DE MALLORCA, AE… ILLES BA…       8     2.74    39.6
# ℹ 937 more rows

Import from API (aemet)

For example, to get data from the station of the airport of El Prat, Barcelona, the code to provide is "0076", obtaining hourly data

aemet_last_obs("0076")
# A tibble: 12 × 23
   idema   lon fint                 prec   alt  vmax    vv    dv   lat  dmax
   <chr> <dbl> <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 0076   2.07 2024-10-20 04:00:00     0     4   7.7   5.6   360  41.3   360
 2 0076   2.07 2024-10-20 05:00:00     0     4   7.7   5.8   360  41.3   360
 3 0076   2.07 2024-10-20 06:00:00     0     4   8.2   5.6   360  41.3    10
 4 0076   2.07 2024-10-20 07:00:00     0     4   7.7   4.3    10  41.3   350
 5 0076   2.07 2024-10-20 08:00:00     0     4   6.7   4.6    10  41.3   350
 6 0076   2.07 2024-10-20 09:00:00     0     4   6.7   4      10  41.3   360
 7 0076   2.07 2024-10-20 10:00:00     0     4   8.2   4      10  41.3    10
 8 0076   2.07 2024-10-20 11:00:00     0     4   6.7   1.8   110  41.3    30
 9 0076   2.07 2024-10-20 12:00:00     0     4   5.1   2.2   210  41.3   150
10 0076   2.07 2024-10-20 13:00:00     0     4   4.1   2.2   200  41.3   190
11 0076   2.07 2024-10-20 14:00:00     0     4   4.1   2.7   220  41.3   230
12 0076   2.07 2024-10-20 15:00:00     0     4   4.1   2.2   190  41.3   190
# ℹ 13 more variables: ubi <chr>, pres <dbl>, hr <dbl>, stdvv <dbl>, ts <dbl>,
#   pres_nmar <dbl>, tamin <dbl>, ta <dbl>, tamax <dbl>, tpr <dbl>, vis <dbl>,
#   stddv <dbl>, inso <dbl>

Import from API (US census)

Import from API

Other options:

💻 Your turn

Try to solve the following exercises without looking at the solutions

📝 The who dataset we have used in previous exercises, export it to a native R format in the data folder of the project

Código
library(tidyr)
save(who, file = "./data/who.RData")

📝 Loads the who dataset but from the data folder (import the file created in the previous exercise)

Código
load("./data/who.RData")

📝 Repeats the same (export and import) in 4 formats: .csv, .xlsx, .sav (spss) and .dta (stata)

Código
# csv
library(readr)
write_csv(who, file = "./data/who.csv")
who_data <- read_csv(file = "./data/who.csv")

# excel
library(openxlsx)
write.xlsx(who, file = "./data/who.xlsx")
who_data <- read_xlsx(path = "./data/who.xlsx")

# sas y stata
library(haven)
write_sav(who, path = "./data/who.sav")
who_data <- read_spss(path = "./data/who.sav")

write_dta(who, path = "./data/who.dta")
who_data <- read_dta(path = "./data/who.dta")

📝 Repeat the loading of who.csv but only select the first 4 columns already in the load.

Código
who_select <-
  read_csv(file = "./data/who.csv",
           col_select = c("country", "iso2", "iso3", "year"))

L8: factors and lists

Handling with factors and lists

Lists

We have already seen that lists are an object in R that allows us to store collections of variables of different type (as with data.frame and tibble) but also different lengths, with totally heterogeneous structures (even a list can have inside it another list).

name <- "Javi"
age <- 34
marks <- c(7, 8, 5, 3, 10, 9)
parents <- c("Paloma", "Goyo")

list_var <- list("name" = name, "age" = age, "marks" = marks, "parents" = parents)
list_var$name
[1] "Javi"
list_var$marks
[1]  7  8  5  3 10  9

Lists

We can also make lists with other lists inside, so that to access each level we must use the [[]] operator.

list_of_lists <- list("list_1" = list_var[1:2], "list_2" = list_var[3:4])
names(list_of_lists)
[1] "list_1" "list_2"
names(list_of_lists[[1]])
[1] "name" "age" 
list_of_lists[[1]][[1]]
[1] "Javi"

We are allowed to store n-dimensional data!

Lists

One of the disadvantages is that a list cannot be vectorized immediately, so any arithmetic operation applied to a list will give error.

data <- list("a" = 1:5, "b" = 10:20)
data / 2
Error in data/2: non-numeric argument to binary operator

For this purpose, one of the common (but outdated) options is to make use of the lapply() family.

lapply(data, FUN = function(x) { x / 2})
$a
[1] 0.5 1.0 1.5 2.0 2.5

$b
 [1]  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0

By default, the output of lapply()is always a list of equal length.

Lists

A more flexible and versatile option is to make use of the {purrr} package of the {tidyverse} environment.

library(purrr)

This package is intended to mimic the functional programming of other languages such as Scala or Hadoop’s map-reduce strategy (from Google).

Lists

The simplest function of the {purrr} package is the map() function, which applies a vectorized function to each of the elements of a list. Let’s see a first example applied to vectors

map() allows us to “map” each list and apply the function element by element (if applicable).

x <- list("x1" = 1:4, "x2" = 11:20)
map(x, sqrt) 
$x1
[1] 1.000000 1.414214 1.732051 2.000000

$x2
 [1] 3.316625 3.464102 3.605551 3.741657 3.872983 4.000000 4.123106 4.242641
 [9] 4.358899 4.472136

Be careful

With vectors we have a default vectorization because R performs element-by-element operations. Note that, by default, the output of map is a list.

Lists

Let’s look at another example. Define in a list two samples from 2 normal distributions, of different sample size and different mean. Compute the mean of each one.

Código
x <- list(rnorm(n = 1500, mean = 0, sd = 0.7),
          rnorm(n = 2800, mean = 2, sd = 1.5))
map(x, mean)
[[1]]
[1] 0.01416045

[[2]]
[1] 1.988931

Lists

What if we want to calculate the mean of their squared values?

Código
map(x, function(x) { mean(x^2) })
[[1]]
[1] 0.5157736

[[2]]
[1] 6.213856

Lists

In addition to being more readable and efficient, with {purrr} we can decide the output format after the operation

  • output as numeric (double) vector with map_dbl()
  • output as numeric (int) vector with map_int()
  • output as character vector with map_chr()
  • output as logical vector with map_lgl()
map_dbl(x, mean)
[1] 0.01416045 1.98893058
map_chr(x, function(x) { glue("Mean is {round(mean(x), 5)}") })
[1] "Mean is 0.01416" "Mean is 1.98893"

Lists

c(x[[1]][3], x[[2]][3])
[1] -1.1821836 -0.9706831

Also, if you pass it a number instead of a function, it will return the ith element of each list.

map_dbl(x, 3)
[1] -1.1821836 -0.9706831

Lists

We also have the option of generalizing it to be able to use functions that need two arguments in the form of a list (binary operations), with map2()

x <- list("a" = 1:3, "b" = 4:6)
y <- list("c" = c(-1, 4, 0), "b" = c(5, -4, -1))
map2(x, y, function(x, y) { x^2 + y^2})
$a
[1]  2 20  9

$b
[1] 41 41 37

Lists

We can obtain the output in the form of data.frame by adding list_rbind() or list_cbind(), which converts a list into a table.

x <- c("a", "b", "c")
y <- 1:3
map2(x, y, function(x, y) { tibble(x, y) }) |> list_rbind()
# A tibble: 3 × 2
  x         y
  <chr> <int>
1 a         1
2 b         2
3 c         3

Lists

We can generalize it further with pmap_xxx() which allows us to use multiple arguments (multiple lists).

x <- list(1, 1, 1)
y <- list(10, 20, 30)
z <- list(100, 200, 300)
pmap_dbl(list(x, y, z), sum)
[1] 111 221 331

Lists

We have other types of iterators that, although they assume inputs, do not return anything, as walk() (just one input argument), walk2() (two arguments) and pwalk() (multiple arguments), all invisibly return, just call a function for its side effects rather than its return value.

list("a" = 1:3, "b" = 4:6) |>
  map2(list("a" = 11:13, "b" = 14:16),
       function(x, y) { x + y }) |> 
  walk(print)
[1] 12 14 16
[1] 18 20 22

💻 It’s your turn

📝 Define a list of 4 elements of different types and access the second of them (I will include one that is a tibble so that you can see that in a list there is room for everything).

Código
list_example <-
  list("name" = "Javier", "cp" = 28019,
       "siblings" = TRUE,
       "marks" = tibble("maths" = c(7.5, 8, 9),
                        "lang" = c(10, 5, 6)))
list_example

📝 From the list above, access the elements that occupy places 1 and 4 of the list defined above.

Código
list_example[c(1, 4)]

list_example$name
list_example$marks

list_example[c("name", "marks")]

📝 Load the starwars dataset from the {dplyr} package and access the second movie that appears in starwars$films (for each character). Determine which ones do not appear in more than one movie.

second_film <- map(starwars$films, 2)
map_lgl(second_film, is.null)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
[13] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
[25]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE
[37]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
[49]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
[61]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
[73]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
[85]  TRUE  TRUE  TRUE

Quali: factors

In statistics, when we talk about qualitative variables, we will call levels or modalities the different values that these data can take. For example, in the case of the sex variable of the starwars set, we have 4 allowed levels: female, hermaphroditic, male and none (in addition to missing data).

starwars |> count(sex)
# A tibble: 5 × 2
  sex                n
  <chr>          <int>
1 female            16
2 hermaphroditic     1
3 male              60
4 none               6
5 <NA>               4

Quali: factors

These kinds of variables are known in R as factors, and the fundamental package to deal with them is {forcats} (from the {tidyverse} environment).

Quali: factors

This package allows us to set the levels (stored internally as levels) that a given categorical variable takes so that no mistakes, errors in data collection and generation can be generated. It also makes their analysis less computationally expensive when doing searches and comparisons, giving them a different treatment than normal text strings.

Let’s see a simple example simulating a party variable taking the values "PP", "PSOE" and "SUMAR" (of size 15)

set.seed(1234567)
party <- sample(x = c("PP", "PSOE", "SUMAR"), size = 15, replace = TRUE)
party
 [1] "PP"    "PSOE"  "SUMAR" "PP"    "PP"    "SUMAR" "PSOE"  "SUMAR" "SUMAR"
[10] "SUMAR" "PP"    "PSOE"  "PP"    "PSOE"  "SUMAR"

The party variable is currently of type text, of type chr, something we can check with class(state).

class(party)
[1] "character"

Quali: factors

From a statistical and computational point of view, for R this variable right now would be equivalent to a named variable. But statistically a variable as a string is not the same as a categorical variable that can only take those 3 levels. How to convert to factor?

By making use of the as_factor() function from the {forcats} package.

library(tidyverse)
party_fct <- tibble("id" = 1:length(party),
                    "party" = as_factor(party))
party_fct
# A tibble: 15 × 2
      id party
   <int> <fct>
 1     1 PP   
 2     2 PSOE 
 3     3 SUMAR
 4     4 PP   
 5     5 PP   
 6     6 SUMAR
 7     7 PSOE 
 8     8 SUMAR
 9     9 SUMAR
10    10 SUMAR
11    11 PP   
12    12 PSOE 
13    13 PP   
14    14 PSOE 
15    15 SUMAR

Quali: factors

Not only the class of the variable has changed, but now, below the saved value, the sentence Levels: ... appears: these are the modalities or levels of our qualitative.

party_fct |> pull(party)
 [1] PP    PSOE  SUMAR PP    PP    SUMAR PSOE  SUMAR SUMAR SUMAR PP    PSOE 
[13] PP    PSOE  SUMAR
Levels: PP PSOE SUMAR

Quali: factors

Imagine that we are defining the database of deputies of the Congress and that the deputies of the PP did not attend that day to the plenary session: although our variable does not take that value THAT DAY, the state PSOE is a allowed level in the database (so even if we eliminate it, because it is a factor, the level remains, we do not have it now but it is an allowed level).

party_fct |> 
  filter(party %in% c("PP", "SUMAR")) |> 
  pull(party)
 [1] PP    SUMAR PP    PP    SUMAR SUMAR SUMAR SUMAR PP    PP    SUMAR
Levels: PP PSOE SUMAR

Quali: factors

With factor() function we can explicitly specify the names of the modalities and using levels = ... we can explicitly tell it the “order” of the modalities

party_fct <-
  tibble(id = 1:length(party),
         party = factor(party, levels = c("SUMAR", "PP", "PSOE")))
party_fct |> pull(party)
 [1] PP    PSOE  SUMAR PP    PP    SUMAR PSOE  SUMAR SUMAR SUMAR PP    PSOE 
[13] PP    PSOE  SUMAR
Levels: SUMAR PP PSOE

Quali: factors

The previous “order” is just in the sense which it will be counted/plotted first) but we don’t have (yet) an ordinal variable

party_fct$party < "SUMAR"
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

What if we want to define a qualitative variable ORDINAL? Inside factor() we must indicate that ordered = TRUE.

marks <- c("A", "E", "F", "B", "A+", "A", "C", "C", "D", "B", "A", "C", "C", "E", "F", "D", "A+")
marks_database <-
  tibble("student" = 1:length(marks),
         "marks" =
           factor(marks, levels = c("F", "E", "D", "C", "B", "A", "A+"),
                  ordered = TRUE))
marks_database |> pull(marks)
 [1] A  E  F  B  A+ A  C  C  D  B  A  C  C  E  F  D  A+
Levels: F < E < D < C < B < A < A+

Quali: factors

What changes? If you notice now, although the variable is still qualitative, we can make comparisons and order the records because there is a hierarchy between the modalities.

marks_database |> filter(marks >= "B")
# A tibble: 7 × 2
  student marks
    <int> <ord>
1       1 A    
2       4 B    
3       5 A+   
4       6 A    
5      10 B    
6      11 A    
7      17 A+   

Quali: factors

If we want to tell it to drop an unused level at that moment (and that we want to exclude from the definition) we can do it with fct_drop().

marks_database |> 
  filter(marks %in% c("F", "E", "D", "C", "B", "A")) |> 
  pull(marks)
 [1] A E F B A C C D B A C C E F D
Levels: F < E < D < C < B < A < A+

marks_database |> 
  filter(marks %in% c("F", "E", "D", "C", "B", "A")) |> 
  mutate(marks = fct_drop(marks)) |>  
  pull(marks)
 [1] A E F B A C C D B A C C E F D
Levels: F < E < D < C < B < A

Quali: factors

Just as we can delete levels we can expand existing levels (even if there is no data for that level at that time) with fct_expand().

marks_database |> 
  mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) %>% 
  pull(marks)
 [1] A  E  F  B  A+ A  C  C  D  B  A  C  C  E  F  D  A+
Levels: F < E < D < C < B < A < A+ < F- < A++

Quali: factors

In addition with fct_explicit_na() we can assign a level to the missing values to be included in the analysis and visualizations.

fct_explicit_na(factor(c("a", "b", NA)))
[1] a         b         (Missing)
Levels: a b (Missing)

Quali: factors

Even once defined we can reorder the levels with fct_relevel().

marks_database_expand <- 
  marks_database |>  
  mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) |> 
  pull(marks)

marks_database_expand |> 
  fct_relevel(c("F-", "F", "E", "D", "C", "B", "A", "A+", "A++"))
 [1] A  E  F  B  A+ A  C  C  D  B  A  C  C  E  F  D  A+
Levels: F- < F < E < D < C < B < A < A+ < A++

Quali: factors

This way of working with qualitative variables allows us to give a theoretical definition of our database, and we can even count values that do not yet exist (but could), making use of fct_count().

marks_database |> 
  mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) |> 
  pull(marks) |> 
  fct_count()
# A tibble: 9 × 2
  f         n
  <ord> <int>
1 F         2
2 E         2
3 D         2
4 C         4
5 B         2
6 A         3
7 A+        2
8 F-        0
9 A++       0

Quali: factors

The levels can also be sorted by frequency with fct_infreq().

marks_database |> 
  mutate(marks = fct_infreq(marks)) |> 
  pull(marks)
 [1] A  E  F  B  A+ A  C  C  D  B  A  C  C  E  F  D  A+
Levels: C < A < F < E < D < B < A+
marks_database |> 
  mutate(marks = fct_infreq(marks)) |> 
  pull(marks) |> 
  fct_count()
# A tibble: 7 × 2
  f         n
  <ord> <int>
1 C         4
2 A         3
3 F         2
4 E         2
5 D         2
6 B         2
7 A+        2

Quali: factors

Sometimes we will want to group levels, for example, not allowing levels that do not happen a minimum number of times with fct_lump_min(..., min = ..) (observations that do not meet this will go to a generic level called Other, although this can be changed with the other_level argument).

marks_database |> 
  pull(marks) %>% 
  fct_lump_min(min = 3)
 [1] A     Other Other Other Other A     C     C     Other Other A     C    
[13] C     Other Other Other Other
Levels: C < A < Other
marks_database |> 
  pull(marks) |>
  fct_lump_min(min = 3,
               other_level = "Less frequent")
 [1] A             Less frequent Less frequent Less frequent Less frequent
 [6] A             C             C             Less frequent Less frequent
[11] A             C             C             Less frequent Less frequent
[16] Less frequent Less frequent
Levels: C < A < Less frequent

Quali: factors

We can do something equivalent but based on its relative frequency with fct_lump_prop().

marks_database |> 
  pull(marks) |> 
  fct_lump_prop(prop = 0.15,
                other_level = "less frequent")
 [1] A             less frequent less frequent less frequent less frequent
 [6] A             C             C             less frequent less frequent
[11] A             C             C             less frequent less frequent
[16] less frequent less frequent
Levels: C < A < less frequent

Quali: factors

We can apply this to our datasets to recategorize variables very quickly.

starwars |>  
  drop_na(species) |> 
  mutate(species =
           fct_lump_min(species, min = 3,
                        other_level = "Others")) |>  
  count(species)
# A tibble: 4 × 2
  species     n
  <fct>   <int>
1 Droid       6
2 Gungan      3
3 Human      35
4 Others     39

Quali: factors

With fct_reorder() we can also indicate that we want to order the factors according to a function applied to another variable.

starwars_factor <- 
  starwars |>  
  drop_na(height, species) |> 
  mutate(species =
           fct_lump_min(species, min = 3,
                        other_level = "Others"))
starwars_factor |>  pull(species)
 [1] Human  Droid  Droid  Human  Human  Human  Human  Droid  Human  Human 
[11] Human  Human  Others Human  Others Others Human  Others Human  Human 
[21] Droid  Others Human  Human  Others Human  Others Others Human  Others
[31] Human  Human  Gungan Gungan Gungan Human  Others Others Human  Human 
[41] Others Others Others Others Others Others Others Human  Others Others
[51] Others Others Others Others Others Others Human  Others Others Others
[61] Human  Human  Human  Human  Others Others Others Others Human  Droid 
[71] Others Others Others Others Others Human  Others
Levels: Droid Gungan Human Others
starwars_factor |> 
  mutate(species = fct_reorder(species, height, mean)) |> 
  pull(species)
 [1] Human  Droid  Droid  Human  Human  Human  Human  Droid  Human  Human 
[11] Human  Human  Others Human  Others Others Human  Others Human  Human 
[21] Droid  Others Human  Human  Others Human  Others Others Human  Others
[31] Human  Human  Gungan Gungan Gungan Human  Others Others Human  Human 
[41] Others Others Others Others Others Others Others Human  Others Others
[51] Others Others Others Others Others Others Human  Others Others Others
[61] Human  Human  Human  Human  Others Others Others Others Human  Droid 
[71] Others Others Others Others Others Human  Others
Levels: Droid Others Human Gungan

💻 It’s your turn

📝 Given the variable months defined below (defined as a character vector), convert this variable to factor (just that)

months <- c("Jan", "Feb", "Mar", "Apr")
Código
months <- c("Jan", "Feb", "Mar", "Apr")
months_fct <- as_factor(months)
months_fct

📝 Given the variable months defined below converts this variable to a factor but indicating the levels correctly.

months <- c(NA, "Apr", "Jan", "Oct", "Jul", "Jan", "Sep", NA, "Feb", "Dic",
           "Jul", "Mar", "Jan", "Mar", "Feb", "Apr", "May", "Oct", "Sep",  NA,
           "Dic", "Jul", "Nov", "Feb", "Oct", "Jun", "Sep", "Oct", "Oct", "Sep")
Código
months_fct <-
  factor(months,
         levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dic"))
months_fct

📝 Count how many values there are for each month but keep in mind that they are factors (maybe there are unused levels and you should get a 0 for them).

Código
meses_fct |>  fct_count()

📝 Since there are missing values, it indicates that the absentee is a thirteenth level labeled "missing".

Código
months_fct <- 
  months_fct |> 
  fct_explicit_na(na_level = "missing")
months_fct

📝 Removes unused levels.

Código
months_fct <- 
  months_fct %>%
  fct_drop()
months_fct

📝 Sort the levels by frequency of occurrence.

Código
months_fct |> 
  fct_infreq()

📝 Group levels so that any level that does not appear at least 7% of the time is grouped in a level called "other months".

Código
months_fct <-
  months_fct |> 
  fct_lump_prop(prop = 0.07, other_level = "other months")
months_fct 

L9: dbplyr and github

dbplyr: SQL connection

We are going to introduce the {dbplyr} package, which will allow us to combine SQL code into R to perform database queries

install.packages("dbplyr")

 

If you are using R to do data analysis, most of the data you need probably already lives in a database. We will learn how to load data in to a local database.

dbplyr: SQL connection

install.packages("RSQLite")

We will also need to install a DBI backend package (an interface that allows {dplyr} to work with many different databases using the same code). Five commonly used backends are:

  • RMariaDB connects to MySQL and MariaDB

  • RPostgres connects to Postgres and Redshift.

  • RSQLite embeds a SQLite database.

  • odbc connects to many commercial databases via the open database connectivity protocol.

  • bigrquery connects to Google’s BigQuery.

In the following, we are going to use the RSQLite backend (automatically installed when you install {dbplyr}).

dbplyr: SQL connection

If we want to work with a database adopting dplyr grammar, we must first connect to it. The connection will be done by using DBI::dbConnect(). The arguments to DBI::dbConnect() vary from database to database, but the first argument is always the database backend. SQLite only needs a path to the database.

connection <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
connection
<SQLiteConnection>
  Path: :memory:
  Extensions: TRUE

Here, the string ":memory:" is a special path that creates a temporary in-memory database.

dbplyr: SQL connection

The temporary database previously created has no yet data in it. We will copy an existing database (for example, gapminder from {gapminder} package) using copy_to(). Using tbl() we “obtain” the database (called as named in copy_to()).

The main difference is that not it is a remote source in a SQLite database.

copy_to(connection, gapminder::gapminder, overwrite = TRUE, name = "gapminder_database")
gapminder_db <- tbl(connection, "gapminder_database")
gapminder_db
# Source:   table<`gapminder_database`> [?? x 6]
# Database: sqlite 3.43.2 [:memory:]
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ more rows

dbplyr: SQL connection

The main goal of {dbplyr} is to automatically generate SQL from dplyr grammar that we are already familiar with.

gapminder_db |>
  select(country, year, lifeExp)
# Source:   SQL [?? x 3]
# Database: sqlite 3.43.2 [:memory:]
   country      year lifeExp
   <chr>       <int>   <dbl>
 1 Afghanistan  1952    28.8
 2 Afghanistan  1957    30.3
 3 Afghanistan  1962    32.0
 4 Afghanistan  1967    34.0
 5 Afghanistan  1972    36.1
 6 Afghanistan  1977    38.4
 7 Afghanistan  1982    39.9
 8 Afghanistan  1987    40.8
 9 Afghanistan  1992    41.7
10 Afghanistan  1997    41.8
# ℹ more rows

dbplyr: SQL connection

The most important difference between ordinary tibbles and remote database queries is that now our R code is translated into SQL and executed in the database on the remote server, not in R on your local machine: it never pulls data into R unless you explicitly ask for it, delaying doing any work until the last possible moment (sends it to the database in one step).

It is not until we ask for the data (writing gapminder_query in console, for example), dplyr verbs generates the SQL and requests the results from the database.

gapminder_query <- 
  gapminder_db |>
  select(country, year, lifeExp)
gapminder_query
# Source:   SQL [?? x 3]
# Database: sqlite 3.43.2 [:memory:]
   country      year lifeExp
   <chr>       <int>   <dbl>
 1 Afghanistan  1952    28.8
 2 Afghanistan  1957    30.3
 3 Afghanistan  1962    32.0
 4 Afghanistan  1967    34.0
 5 Afghanistan  1972    36.1
 6 Afghanistan  1977    38.4
 7 Afghanistan  1982    39.9
 8 Afghanistan  1987    40.8
 9 Afghanistan  1992    41.7
10 Afghanistan  1997    41.8
# ℹ more rows

dbplyr: SQL connection

What is happening behind the scenes?

{dplyr} is translating our R code into SQL verbs, that we can check (without doing) by using show_query()

gapminder_query |> 
  show_query()
<SQL>
SELECT `country`, `year`, `lifeExp`
FROM `gapminder_database`

dbplyr: SQL connection

As commented, our remote database queries is not in R on your local machine: it never pulls data into R unless you explicitly ask for it. To do it, you can iterate a few times before you figure out what data you need, and then we use collect().

gapminder_tibble <-
  gapminder_query |> 
  collect()
gapminder_tibble
# A tibble: 1,704 × 3
   country      year lifeExp
   <chr>       <int>   <dbl>
 1 Afghanistan  1952    28.8
 2 Afghanistan  1957    30.3
 3 Afghanistan  1962    32.0
 4 Afghanistan  1967    34.0
 5 Afghanistan  1972    36.1
 6 Afghanistan  1977    38.4
 7 Afghanistan  1982    39.9
 8 Afghanistan  1987    40.8
 9 Afghanistan  1992    41.7
10 Afghanistan  1997    41.8
# ℹ 1,694 more rows

dbplyr: SQL connection

You can also use translate_sql() to translate individual expressions from dplyr to SQL

dbplyr::translate_sql((x + y) / 2, con = connection)
<SQL> (`x` + `y`) / 2.0