SSPS4102Data Analytics in the Social Sciences

class: center, middle, inverse, title-slide

.title[
# SSPS4102</br>Data Analytics in the Social Sciences
]
.subtitle[
## Week 04</br>Data collection and manipulation
]
.author[
### Francesco Bailo
]
.institute[
### The University of Sydney
]
.date[
### Semester 1, 2023 (updated: 2023-03-15)
]

---

background-image: url(https://upload.wikimedia.org/wikipedia/en/6/6a/Logo_of_the_University_of_Sydney.svg)
background-size: 95%

---

## Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The  University of Sydney is located on the land of the Gadigal people  of the Eora Nation. I pay my respects to their Elders, past and present.

---
background-image: url(https://www.antiprism.com/misc/tri_jit_rd01.gif)
background-size: 100%

# Data collection and tranformation

---
class: inverse, center, middle

# Data collection

---

## Data collection (social sciences)

.pull-left[

In *research design*, **data collection** is the process of *creating* your research data. In the social sciences there are a number of methods to create interesting data sets. They can be of two types:

1. **Observational**: In an observational study, we don't influence in anyway the behaviour of the subjects of our study - who should not adapt their behavior because of or in response to our investigation.

2. **Experimental**: In an experiment we design and control the environment where our subjects will operate.

]

.pull-right[

.center[<img src = 'https://1.bp.blogspot.com/-Pt0cQYzFA1A/Xshlv9SviGI/AAAAAAAAAmY/zEMnvw3fq_spdjYmuJ6f8t4q9q20k1kbQCLcBGAsYHQ/w490/pasteur6.gif'></img>]

<mark>Try to classify a number of data collection instruments in **Menti**, either as Observational or Experimental...</mark>

]

---

## Data collection (social sciences)

Indeed we might need to add a couple of extra types...

3. **Asking questions**: It is an type of study where we record responses and reactions to questions or other input we design through instruments of our design (online questionnaire, unstructured interview, focus group). The subject is clearly aware of the research setting and the research design should limit the impact of this awareness. 
  
4. **Quasi-experimental**: A quasi-experimental study use data recorded in a non-experimental setting. Still, these data will approximate experimental data either because of

1. particular events or 
    
    2. through "statistically adjusting non-experimental data in an attempt to account for preexisting differences between those who did and did not receive the treatment"<sup>1</sup>, (for example with a technique called "matching"). 
  
.footnote[[1] Salganik, M. J. (2018). Bit by bit: Social research in the digital age. Princeton University Press.]

---

# Data collection (social sciences)

Independently of the *subject* of your data (so the WHOs your data are about), these data sets, can be

1. created directly by you (the researcher) - to which we refer to as **primary** or **first-party data**;

2. created by some other research organisation such as the ABS (**secondary** or **second-party data**); or;

3. commonly nowadays created using different second-party sources by some *data aggregating organisation* (**third-party data**).

---

## Data collection (data analytics)

In data analytics, data collection generally has a slighlty different meaning. *Data collection* refers here to the process of bringing the data into a computing environment and ready that data for the analysis.

### Data collection in research design

<div id="htmlwidget-852d26a0ba215a940c75" style="width:100%;height:100px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-852d26a0ba215a940c75">{"x":{"diagram":"\n  digraph graph2 {\n  \n  graph [layout = dot, rankdir = LR]\n  \n  # node definitions with substituted label text\n  node [shape = oval, style = filled]\n  a [label = \"Data collection\", fillcolor = Orange]\n  b [label = \"Data\", shape = egg, fillcolor = Lightblue]\n  c [label = \"Data analysis\"]\n  d [label = \"Results\"]\n\n  \n  a -> b -> c -> d;\n  }\n  ","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

### Data collection in data analytics

<div id="htmlwidget-cf9f3879a9603ee761cf" style="width:100%;height:100px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-cf9f3879a9603ee761cf">{"x":{"diagram":"\n  digraph graph2 {\n  \n  graph [layout = dot, rankdir = LR]\n  \n  # node definitions with substituted label text\n  node [shape = oval, style = filled]\n  a [label = \"Data\", shape = egg, fillcolor = Lightblue]\n  b [label = \"Data collection\", fillcolor = Orange]\n  c [label = \"Computer\", shape = box]\n  d [label = \"Data analysis\"]\n  e [label = \"Results\"]\n\n  \n  a -> b -> c -> d -> e;\n  }\n  ","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

---

## Data collection (data analytics)

.pull-left[

For data scientists, the problem is not creating the data. Data science is about existing data - not about creating new data through rigorous scientific methods.

This doesn't mean you should not care about how the data was created in the first place. In fact, as every rigourour research you should!

But, in a data science context, the scope of the *data collection* is different from the scope of *data collection* in a social science research project.

]

.pull.right[

.center[<img src = 'https://media.giphy.com/media/KzKFAvaM1RBoRU5dcl/giphy.gif'></img>]

]

---

## So when data scientists talk about data collection, what do they talk about?

1. **Sourcing**: Where is the data store? How big is the data? How is structured? How can I access it (e.g. direct download, database, API)?

.center[↓]

2. **Reading in**: How do I import the data I need into my computer? You can integrally read in the data you need but sometimes you can pre-filter the data at the source with a query (e.g. with a SQL query if you are requesting data from a database or a POST query if you are requesting data from an API...)

.center[↓]

3. **Cleaning/Preparing**: How do I prepare the data I imported for my analysis? This is about structuring the data, exactly how I need it? <mark>(This is what we'll discuss during the lab today...)</mark>.

---

## Sourcing and Reading in

Note: Big data for computer scientists `$\neq$` Big data for social scientists.

When data is "small" (let's say, it can fit in a spreadsheet or can be shared via email), sourcing and reading in, is generally not a problem. For example, you can simply read your file into memory by pointing your software to that file while specifying the format:

```r
dat <- read.csv(file = "my_file.csv")
```

But when data is "big" (let's say it won't fit into your memory), you need to deal with sourcing and reading issues.

Example of big data:

1. The text of all revisions of all pages of en.wikipedia.org;

2. All tweets with the hashtag "#metoo" published since October 2017.

Simply put, you can't `read(source = "http://en.wikipedia.org")`.

---

## Sourcing and reading in

A detail discussion about sourcing and reading data is out of the scope of this unit. Still, as example, here an example of a common "pipeline" to collect social media data.

<div id="htmlwidget-d0378cd5cd619eeb007d" style="width:100%;height:100px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-d0378cd5cd619eeb007d">{"x":{"diagram":"\n  digraph graph2 {\n  \n  graph [layout = dot, rankdir = LR]\n\n  node [shape = oval, style = filled]\n  a [label = \"Query\\nPOST /twitter/api\\n{ query : metoo, since : 20171001 }\", \n  shape = box, fillcolor = Lightblue]\n  b [label = \"API\", fillcolor = Orange, shape = egg]\n  c [label = \"Data\\njson format\"]\n  \n  d [label = \"Read json\", shape = box, fillcolor = Lightblue]\n  \n  e [label = \"Computer\", shape = box]\n\n  \n  a -> b -> c; d -> c; c -> d; d -> e;\n  }\n  ","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

If instead of an API, you have a database you should expect a very similar pipeline.

</br>

.pull-left[
R offers a large number of tools and packages to deal about all your data sourcing and data reading issues. Yet, we will not cover them in this unit.

But let's just quickly, talk about APIs...
]

.pull-right[
.center[<img src = 'https://media.giphy.com/media/y69OrzhaWfE7frw6OI/giphy.gif' width = '55%'></img>]
]

---

## What's an API again?

.center[

]

---
class: inverse, center, middle

# Lab: Data transformation with dplyr

.center[<img src = 'https://media.giphy.com/media/R9zXHWAHyTjnq/giphy.gif'></img>]

---

## Prerequisites<sup>1</sup>

We are going to use these two packages:

```r
library(nycflights13)
library(tidyverse)
```

Install them with

```r
install.packages(c("nycflights13", "tidyverse"))
```

nycflights13 includes is a data packages (and includes the data we will use in the examples).

As mentioned last week, tidyverse is not really a package but a collection of packages...

.footnote[[1] Adapted from https://r4ds.had.co.nz/transform.html (Wickham & Grolemund, 2017)]

---

## R's Tidyverse

.center[<img src = 'https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-1-getting-started/img/tidyverse-packages.png' width = "95%"></img>]

Image from [Mine Çetinkaya-Rundel](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-1-getting-started/)

---

## nycflights13::flights

```r
nycflights13::flights # (or `flights` for short)
```

```
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
```

---

## nycflights13::flights

The object `nycflights13` is not of class `data.frame` but **tibble**. We have already encountered the tibble last week. For our purposes, let's consider a tibble as a data frame (which as a matter of fact is), so a two-dimensional data objects with values ordered in rows and columns.

In a data frame, each column can be variable of a different class (or type). We have already met a few of these types, here among the most important:

* `int` stands for **integers** (e.g. `c(1, 2, 3, 4)`).

* `dbl` stands for **doubles**, or real numbers:  (e.g. `c(1.5, 2.323132, 3.5445, 4.565)`).

* `chr` stands for **character** vectors, or strings.

* `lgl` stands for **logical**, vectors that contain only TRUE or FALSE.

* `fctr` stands for **factors**, which R uses to represent *categorical variables* with fixed possible values.

---

## dplyr basics

dplyr is a package for manipulating data frame like objects.

Today we will use it to learn to

> * Pick observations by their values with `filter()`.
>
> * Reorder the rows with `arrange()`.
>
> * Pick variables by their names  with `select()`.
>
> * Create new variables with functions of existing variables  with `mutate()`.
>
> * Collapse many values down to a single summary  with `summarise()` in combination with `group_by()` (which define the variables to use for the summary).

---

## dplyr basics

> All verbs (functions) work similarly:
>
> * The first argument is a data frame.
>
> * The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
>
> * The result is a new data frame.

So in R...

```r
verb(dataframe, arguments, ...)
```

or with `filter()`

```r
dplyr::filter(flights, air_time == 100)
```

---

## Filter rows with filter()

You can filter rows (i.e. records) of your data frame with the function `filter()`.

> The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:

```r
dplyr::filter(flights, month == 1, day == 1)
```

```
## # A tibble: 842 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
```

---

## Filter rows with filter()

Remember that R won't automagically store the resulting, new data frame everywhere. If you don't specify it, your outcome will be lost, as tears in the rain.

.center[
<img src ='https://media.giphy.com/media/7aFDGPdxGLuEg/giphy.gif'></img>
]

To save your results, do

```r
jan1 <- dplyr::filter(flights, month == 1, day == 1)
```

---

## Comparisons

> To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

> When you’re starting out with R, the easiest mistake to make is to use = instead of == when testing for equality. When this happens you’ll get an informative error.

```r
dplyr::filter(flights, month = 1)
```

---

## Logical operators

You can combine multiple logical comparison in your statement with the following operators:

* `&` is “and”,

* `|` is “or”, and

* `!` is “not”.

The code

```r
dplyr::filter(flights, month == 11 & year == 2013)
```

will filter records where month is `11` but also where year is `2013`. We will come back to logical operators next week.

---

## Select columns with select()

If you want to simplify your data frame by keeping only some of the columns (this is useful if your data frame contains many columns).

```r
dplyr::select(flights, year, month, day)
```

```
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---

## Create new variables with mutate()

The function `mutate()` will add a new column at the end of your data frame (on the right...). Of course, it is not enough to only name the new column. You also need to specify the content of the new variable.

The most common scenario is to use `mutate()` to transform an existing variable or to operate on two (or more) variables to create a new one. For example,

```r
dplyr::mutate(flights,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
```

creates two new columns `gain` and `speed`. For example, `gain` is the result of the subtraction of the value `arr_delay` from the value `dep_delay` (again, for each record). So,

.center[
`$gain_i = dep\_delay_i - arr\_delay_i$`
]

where `$i$` is one of the number of rows.

---

## Create new variables with mutate()

```r
dplyr::mutate(flights,
* gain = dep_delay - arr_delay,
* speed = distance / air_time * 60)
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> dep_delay </th>
   <th style="text-align:right;"> arr_delay </th>
   <th style="text-align:right;"> gain </th>
   <th style="text-align:right;"> distance </th>
   <th style="text-align:right;"> air_time </th>
   <th style="text-align:right;"> speed </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> 2 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 11 </td>
   <td style="text-align:right;background-color: orange !important;"> -9 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 1400 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 227 </td>
   <td style="text-align:right;background-color: orange !important;"> 370.0441 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> 4 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 20 </td>
   <td style="text-align:right;background-color: orange !important;"> -16 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 1416 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 227 </td>
   <td style="text-align:right;background-color: orange !important;"> 374.2731 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> 2 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 33 </td>
   <td style="text-align:right;background-color: orange !important;"> -31 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 1089 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 160 </td>
   <td style="text-align:right;background-color: orange !important;"> 408.3750 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;font-weight: bold;color: white !important;background-color: #D7261E !important;"> -1 </td>
   <td style="text-align:right;background-color: lightblue !important;font-weight: bold;color: white !important;background-color: #D7261E !important;"> -18 </td>
   <td style="text-align:right;background-color: orange !important;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 17 </td>
   <td style="text-align:right;background-color: lightblue !important;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1576 </td>
   <td style="text-align:right;background-color: lightblue !important;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 183 </td>
   <td style="text-align:right;background-color: orange !important;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 516.7213 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> -6 </td>
   <td style="text-align:right;background-color: lightblue !important;"> -25 </td>
   <td style="text-align:right;background-color: orange !important;"> 19 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 762 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 116 </td>
   <td style="text-align:right;background-color: orange !important;"> 394.1379 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> -4 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 12 </td>
   <td style="text-align:right;background-color: orange !important;"> -16 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 719 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 150 </td>
   <td style="text-align:right;background-color: orange !important;"> 287.6000 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> -5 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 19 </td>
   <td style="text-align:right;background-color: orange !important;"> -24 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 1065 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 158 </td>
   <td style="text-align:right;background-color: orange !important;"> 404.4304 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> -3 </td>
   <td style="text-align:right;background-color: lightblue !important;"> -14 </td>
   <td style="text-align:right;background-color: orange !important;"> 11 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 229 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 53 </td>
   <td style="text-align:right;background-color: orange !important;"> 259.2453 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> -3 </td>
   <td style="text-align:right;background-color: lightblue !important;"> -8 </td>
   <td style="text-align:right;background-color: orange !important;"> 5 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 944 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 140 </td>
   <td style="text-align:right;background-color: orange !important;"> 404.5714 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> -2 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 8 </td>
   <td style="text-align:right;background-color: orange !important;"> -10 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 733 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 138 </td>
   <td style="text-align:right;background-color: orange !important;"> 318.6957 </td>
  </tr>
</tbody>
</table>

---

## Grouped summaries with summarise()

The verb `summarise()` "collapses a data frame to a single row".

```r
dplyr::summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
```

```
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6
```

Great, it doesn't look that `dplyr::summarise()` does much than simply `mean(flights$dep_delay, na.rm = TRUE)`...

Indeed! `dplyr::summarise()` is only useful after `group_by()`!

.footnote[Note: We already introduced the function `mean()`, which as expected returned the mean of a vector of numbers. Here `mean()` takes an additional argument `na.rm = TRUE`. It simply tells the function to disregard all the missing values (`NA`) while computing the mean. By default, `mean` will do `na.rm = FALSE)` - and with this setting, the presence of a single `NA` in your vector will force `mean()` to return `NA`. ]

---

## Grouped summaries with summarise() and group_by()

> [`group_by()`] changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by `year, month and day`, we get the average delay per date:

```r
by_day <- dplyr::group_by(flights, 
*                         year, month, day)
```

And then...

.pull-left[

```r
summarise(
  by_day, 
  delay = mean(dep_delay, 
               na.rm = TRUE))
```
]

.pull-right[

```
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # … with 355 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]

---

## Grouped summaries with summarise() and group_by()

Let's take another example:

```r
us_population_by_county <- 
  read.csv("https://raw.githubusercontent.com/fraba/SSPS4102/master/data/US-pop-by-county.csv")
nrow(us_population_by_county)
```

```
## [1] 3142
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> county </th>
   <th style="text-align:left;"> state </th>
   <th style="text-align:right;"> population </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Moore County </td>
   <td style="text-align:left;"> Texas </td>
   <td style="text-align:right;"> 21904 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Lawrence County </td>
   <td style="text-align:left;"> Ohio </td>
   <td style="text-align:right;"> 62450 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Otero County </td>
   <td style="text-align:left;"> Colorado </td>
   <td style="text-align:right;"> 18831 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Allen Parish </td>
   <td style="text-align:left;"> Louisiana </td>
   <td style="text-align:right;"> 25764 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Johnston County </td>
   <td style="text-align:left;"> North Carolina </td>
   <td style="text-align:right;"> 168878 </td>
  </tr>
</tbody>
</table>

---

## Grouped summaries with summarise() and group_by()

```r
us_population_by_state <- 
  dplyr::group_by(us_population_by_county, 
*          state)
```

and then,

```r
dplyr::summarise(us_population_by_state, 
*                sum(...))
```

What we replace `...` with to get the `sum()` of the population for each **state**?

---

## Grouped summaries of factors with count() and group_by()

Let's say that you have a non-numeric variable (character or factor). How can you summarise it? You can use `count()` on your grouped data frame.

```r
us_population_by_state <- 
  dplyr::group_by(us_population_by_county, state)

dplyr::count(us_population_by_state)
```

```
## # A tibble: 51 × 2
## # Groups:   state [51]
##    state                       n
##    <chr>                   <int>
##  1 " Alabama"                 67
##  2 " Alaska"                  29
##  3 " Arizona"                 15
##  4 " Arkansas"                75
##  5 " California"              58
##  6 " Colorado"                64
##  7 " Connecticut"              8
##  8 " Delaware"                 3
##  9 " District of Columbia"     1
## 10 " Florida"                 67
## # … with 41 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---

## The pipe (%>%) operator

But now let's introduce the pipe operator...

.center[<img src = 'https://magrittr.tidyverse.org/logo.png' width = '10%'></img>]

The pipe operator, which you type with a (per cent sign) `%` + `>` + `%`, *chains* function together so that the value returned from the function *on the left* of the `%>%` flows as first argument into the function *on the right* of the `%>%`.

<div id="htmlwidget-4595a6ddf4cb3299f323" style="width:100%;height:100px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-4595a6ddf4cb3299f323">{"x":{"diagram":"\n  digraph graph2 {\n  \n  graph [layout = dot, rankdir = LR]\n  \n  # node definitions with substituted label text\n  node [shape = box, style = filled]\n  a [label = \"data\"]\n  b [label = \"%>%\", shape = oval, fillcolor = Lightblue]\n  c [label = \"fun()\", fillcolor = Orange]\n  d [label = \"%>%\", shape = oval, fillcolor = Lightblue]\n  e [label = \"fun()\", fillcolor = Orange]\n  f [label = \"%>%\", shape = oval, fillcolor = Lightblue]\n  g [label = \"fun()\", fillcolor = Orange]\n\n  \n  a -> b -> c -> d -> e -> f -> g;\n  }\n  ","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

And in R...

```r
flights %>% filter(air_time < 100) %>% mutate(new = 1)
```

---

## The pipe (%>%) operator

The pipe operator doesn't add new functions to your toolbox. But it drastically improves and simplifies how you can write your code.

Also, with `%>%` you don't need to store your results in temporary/intermediary variables. Instead, you define your transformations and you store the results from the entire pipeline of transforming functions.

```r
my_new_dataframe <- 
* flights %>% # Source of data
* filter(air_time < 100) %>% # First transformation
* mutate(new = 1) # Second transformation
```

Note that you can improve the readability of your code using a new line for each command separated chained with a `%>%`. In this code, after three steps, the results will be stored in the object `my_new_dataframe`.

Also, note that `filter(air_time < 100)` and the other verbs do not specify a data frame (which must be the first argument of `filter()` and `mutate()`) because the data is piped in, top-down, with `%>%`.

---

## Piping all together with %>%: Example 1

```r
us_population_by_county %>%
  dplyr::group_by(state) %>%
  dplyr::summarise(tot_pop = sum(population / 1000000),
                   mean_pop = mean(population / 1000000))
```

```
## # A tibble: 51 × 3
##    state                   tot_pop mean_pop
##    <chr>                     <dbl>    <dbl>
##  1 " Alabama"                4.78    0.0713
##  2 " Alaska"                 0.710   0.0245
##  3 " Arizona"                6.39    0.426 
##  4 " Arkansas"               2.92    0.0389
##  5 " California"            37.3     0.642 
##  6 " Colorado"               5.03    0.0786
##  7 " Connecticut"            3.57    0.447 
##  8 " Delaware"               0.898   0.299 
##  9 " District of Columbia"   0.602   0.602 
## 10 " Florida"               18.8     0.281 
## # … with 41 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

What is `mean_pop`?

---

## Piping all together with %>%: Example 2

```r
us_population_by_county %>%
  dplyr::group_by(state) %>%
  dplyr::summarise(tot_pop = sum(population / 1000000),
                   mean_pop = mean(population / 1000000)) %>%
  ggplot(aes(x = state, y = tot_pop)) +
  geom_col() + 
  theme(axis.text.x = element_text(angle = 45))
```

.footnote[`theme(axis.text.x = element_text(angle = 45))` rotate the text of the axis X label by 45 degrees]

---

## Finally, the rowwise operator!

Sometimes a function will need to operate *row-wise*. In other words, you can't pass it two vectors - as it can only work with one!

For example, the function `mean()` expect a vector of numbers: `mean( c(1, 5, 10) )`. But what if I want to compute, for each record, the mean of values in 2+ columns and store it in a new column?

For example, the data frame `flights` has a variable for `dep_delay` and a variable for `arr_delay`. What if I want for each flight the mean delay? So for each row `$i$`:

`$$\frac{dep\_delay_i+arr\_delay_i}{2}$$`
---

## Finally, the rowwise operator!

```r
flights %>% 
* dplyr::rowwise() %>%
  dplyr::mutate(mean_delay = mean(c(dep_delay, arr_delay)))
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> dep_delay </th>
   <th style="text-align:right;"> arr_delay </th>
   <th style="text-align:right;"> mean_delay </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> 2 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 11 </td>
   <td style="text-align:right;background-color: orange !important;"> 6.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> 4 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 20 </td>
   <td style="text-align:right;background-color: orange !important;"> 12.0 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> 2 </td>
   <td style="text-align:right;background-color: lightblue !important;"> 33 </td>
   <td style="text-align:right;background-color: orange !important;"> 17.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> -1 </td>
   <td style="text-align:right;background-color: lightblue !important;"> -18 </td>
   <td style="text-align:right;background-color: orange !important;"> -9.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: lightblue !important;"> -6 </td>
   <td style="text-align:right;background-color: lightblue !important;"> -25 </td>
   <td style="text-align:right;background-color: orange !important;"> -15.5 </td>
  </tr>
</tbody>
</table>