class: center, middle, inverse, title-slide .title[ # SSPS4102Data Analytics in the Social Sciences ] .subtitle[ ## Week 03Data visualisation ] .author[ ### Francesco Bailo ] .institute[ ### The University of Sydney ] .date[ ### Semester 1, 2023 (updated: 2023-03-14) ] --- background-image: url(https://upload.wikimedia.org/wikipedia/en/6/6a/Logo_of_the_University_of_Sydney.svg) background-size: 95% --- ## Acknowledgement of Country I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present. --- background-image: url(https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9780133041187/files/graphics/fmfig02.jpg) background-size: 100% # Data visualisation --- ## The five qualities of great visualisations 1. It is **truthful**, as it’s based on thorough and honest research. 2. It is **functional**, as it constitutes an accurate depiction of the data, and it’s built in a way that lets people do meaningful operations based on it (seeing change in time). 3. It is **beautiful**, in the sense of being attractive, intriguing, and even aesthetically pleasing for its intended audience—scientists, in the first place, but the general public, too. 4. It is **insightful**, as it reveals evidence that we would have a hard time seeing otherwise. 5. It is **enlightening** because if we grasp and accept the evidence it depicts, it will change our minds for the better. Cairo, A. (2016). 2. The five qualities of great visualizations. *In The Truthful Art: Data, Charts, and Maps for Communication*. Pearson Education. --- ## What Makes Bad Figures Bad ### 1. Bad taste .center[<img src = 'https://socviz.co/assets/ch-01-chartjunk-life-expectancy.png' width = '70%'></img>] Healy, K. (2018). *Data visualization: A practical introduction*. Princeton University Press. --- ## What Makes Bad Figures Bad ### 2. Bad data .center[<img src = 'https://socviz.co/assets/ch-01-democracy-nyt-version.png' width = '80%'></img>] Healy, K. (2018). *Data visualization: A practical introduction*. Princeton University Press. --- ## What Makes Bad Figures Bad ### 3. Bad perception .center[<img src = 'https://socviz.co/dataviz-pdfl_files/figure-html4/ch-01-preception-data-1.png'></img>] Healy, K. (2018). *Data visualization: A practical introduction*. Princeton University Press. --- ## Basic perceptual tasks for nine chart types .center[<img src ='https://socviz.co/assets/ch-01-channels-for-cont-data-vertical.png' width = "15%"></img>] > Channels for mapping unordered categorical data, arranged top-to-bottom from more to less effective, after Munzer (2014, 102). Source: Healy, K. (2018). Data visualization: A practical introduction. Princeton University Press. --- ## Why do we visualise data? #### 1. We visualise data because we want to understand it as part of our *analysis* process. .pull-left[ <img src="week-03_files/figure-html/unnamed-chunk-1-1.svg" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ Consider these data. These four combinations of variables return the exact same correlation and slope. And yet these variables are totally different - something you might miss if you don't visualise it before jumping into your analysis. ] --- ## Why do we visualise data? #### 2. We visualise data to (1) communicate to (2) some audience (3) selected insights from our data analysis excercise .center[<img src = 'https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9780133041187/files/graphics/02fig13_newyork.jpg' width = '85%'></img>] --- class: inverse, center, middle # Lab --- ## The gapminder package To replicate the code in the next slide, you need to install and load the *gapminder* package. .pull-left[ Do it now with ```r install.packages("gapminder") library(gapminder) ``` ] .pull-right[ <img src = 'https://media.giphy.com/media/3o6Ygfw40tlnPhX87m/giphy.gif'></img> ] --- ## Preliminary information (1/2) Q: **What is a tibble?** A: Tibbles *are* data frames.<sup>1</sup> ```r gapminder ``` ``` ## # A tibble: 1,704 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ## 7 Afghanistan Asia 1982 39.9 12881816 978. ## 8 Afghanistan Asia 1987 40.8 13867957 852. ## 9 Afghanistan Asia 1992 41.7 16317921 649. ## 10 Afghanistan Asia 1997 41.8 22227415 635. ## # … with 1,694 more rows ``` .footnote[ [1] If you want to know more: https://r4ds.had.co.nz/tibbles.html ] --- ## Preliminary information (2/2) Q: **What is tidy data?** A: Simply, *tidy* data is data that is in a single rectangular data object of class `data.frame` (or `tibble`), where .pull-left[ 1. Every **column** is a variable. 2. Every **row** is an observation. 3. Every **cell** is a single value. ] .pull-right[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 1952 </td> <td style="text-align:right;"> 28.801 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 1957 </td> <td style="text-align:right;"> 30.332 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 1962 </td> <td style="text-align:right;"> 31.997 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 1967 </td> <td style="text-align:right;"> 34.020 </td> </tr> </tbody> </table> ] </br> Remember that if your data is not *tidy*, ggplot2 will not take it! For more information, https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html --- ## Visualisations in R The two most popular ways to make data visualisation in R are 1. To use the **base** package. 2. To use the **ggplot2** package, which is part of the *Tidyverse*, "a collection of R packages designed for data science" (see https://www.tidyverse.org/). As you will soon realise, the textbook doesn't use ggplot2 but instead produces all the visualisations using the base package. --- ### Visualisations in R: base vs ggplot2 With the base package ... ```r plot(x = gapminder$gdpPercap, y = gapminder$lifeExp) ``` ... and with ggplot2 ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` --- ### Visualisations in R: base vs ggplot2 With these results: .pull-left[ <div class="figure" style="text-align: center"> <img src="week-03_files/figure-html/unnamed-chunk-7-1.svg" alt="base" width="80%" /> <p class="caption">base</p> </div> ] .pull-right[ <div class="figure" style="text-align: center"> <img src="week-03_files/figure-html/unnamed-chunk-8-1.svg" alt="ggplot2" width="80%" /> <p class="caption">ggplot2</p> </div> ] --- class: inverse, center, middle # ggplot2: the basics --- ## Why to use ggplot2? 1. If ggplot2 is more *expensive* in terms of lines of code (see previous slides) is also more *effective* and *intuitive* for making *complex* visualisations. 2. By learning ggplot2, you learn a basic visualisation grammar that is use by a large number of third-party packages (ggplot2 is the *de-facto* visualisation standard for R). 3. ggplot2 figures look much better already in its default version (so without spending time fine-tuning it). 4. The documentation is great. .center[<img src = 'https://ggplot2.tidyverse.org/logo.png' width = '30%'></img>] --- ## The basics of ggplot2 ggplot2 builds your visualisation by *mapping* * **variables** onto * **visual elements** or **aesthetics**: with the function `aes()` (e.g. lines, dots, colours, shapes, areas, labels, ect...) ```r p <- ggplot(data = <data>, mapping = aes(<aesthetic> = <variable>, <aesthetic> = <variable>, < ... > = < ... >)) # Don't run. Credit: Healy, 2018 ``` --- ## The basics of ggplot2 This creates a ggplot object, with all the instructions to map *variables* to *aesthetics.* But it won't visualise anything yet... ```r p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y =lifeExp)) class(p) ``` ``` ## [1] "gg" "ggplot" ``` <img src="week-03_files/figure-html/unnamed-chunk-11-1.svg" width="30%" style="display: block; margin: auto;" /> --- ## The basics of ggplot2 After we have specified `data = <data>` and `mapping = aes(...)`, we need to pass at least an additional *layer* specifying the *geometry* we want, so that ggplot2 can visualise the mapping. ```r p + geom_point() ``` <img src="week-03_files/figure-html/unnamed-chunk-12-1.svg" width="45%" style="display: block; margin: auto;" /> --- ## Layers ggplots are constructed by progressively adding new layers with new specifics about your plot. The minimum number required is two: `ggplot(...)` + `geom_<type>(...)` with all the other required layers being set by default. .pull-left[ ```r *ggplot(...) + * geom_<type>(...) + scale_<mapping>_<type>(...) + coord_<type>(...) + labs(...) + facet_grid(...) # Don't run. Credit: Healy, 2018 ``` ] .pull-right[<img src = 'https://media.giphy.com/media/XMgCFjsCSARxK/giphy.gif'></img>] --- ## Essential ggplot2 decisions ### Layer 0 (the base, invisible layer) 1. Tell the `ggplot()` function what our data is with the `data = <data>` argument. 2. Tell the `ggplot()` function the *relationships* you want to visualise with the `mapping` attribute and the `aes()` function (`mapping = aes()`). ### Layer 1 (visible) 3. Tell the geometry you want to see with `geom_<type>` (e.g. `geom_point()`) ### Layer 2+ (visible) 4. Use additional functions (e.g. `scale_y_continuous()`) to specify everything else 😊 --- class: inverse, center, middle # Aesthetic mapping --- ## Aesthetic mapping Consider this, * Data => variables => *values* * Aesthetics => properties => *levels* Values and levels need to be of the same type (continuous vs categorical or you will get an error/warning) And remember, > An aesthetic is a visual property of the objects in your plot. > You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. (Wickham & Grolemund, 2017) .center[<img src = 'https://media.giphy.com/media/xT9IgpL2wyBi1tCWYg/giphy-downsized-large.gif'></img>] --- ## Aesthetic mapping ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point() ``` <img src="week-03_files/figure-html/unnamed-chunk-14-1.svg" width="60%" style="display: block; margin: auto;" /> What is going on here? What are the aesthetic *properties* (or *arguments*) that we have set? What do they do? --- ### Aesthetic mapping There are a many different aesthetic property you can set and whether you need to specify them depends on the type of geometry you will use. So let's set up our layer 0 and store it into `g`... ```r g <- ggplot(data = gapminder) ``` Among the most common aesthetic properties are **`x`** and **`y`** which position in a 2D space. These are usually the only properties that are strictly required. ```r g + geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) ``` <img src="week-03_files/figure-html/unnamed-chunk-16-1.svg" width="30%" style="display: block; margin: auto;" /> --- ### Aesthetic mapping #### size ```r g + geom_point(mapping = aes(x = gdpPercap, y = lifeExp, * size = lifeExp)) ``` <img src="week-03_files/figure-html/unnamed-chunk-17-1.svg" width="60%" style="display: block; margin: auto;" /> --- ### Aesthetic mapping #### shape ```r g + geom_point(mapping = aes(x = gdpPercap, y = lifeExp, * shape = continent)) ``` <img src="week-03_files/figure-html/unnamed-chunk-18-1.svg" width="60%" style="display: block; margin: auto;" /> --- ### Aesthetic mapping #### colour ```r g + geom_point(mapping = aes(x = gdpPercap, y = lifeExp, * colour = continent)) ``` <img src="week-03_files/figure-html/unnamed-chunk-19-1.svg" width="60%" style="display: block; margin: auto;" /> --- ### Aesthetic mapping #### label ```r g + * geom_text(mapping = aes(x = gdpPercap, y = lifeExp, * label = continent)) ``` <img src="week-03_files/figure-html/unnamed-chunk-20-1.svg" width="60%" style="display: block; margin: auto;" /> --- ### Aesthetic mapping vs aesthetic setting #### label ```r g + geom_text(mapping = aes(x = gdpPercap, y = lifeExp), * label = "My new label") ``` <img src="week-03_files/figure-html/unnamed-chunk-21-1.svg" width="50%" style="display: block; margin: auto;" /> What happened here? --- ### Aesthetic mapping vs aesthetic setting #### alpha (transparency, in range 0 to 1) ```r g + geom_point(mapping = aes(x = gdpPercap, y = lifeExp), * alpha = 0.2) ``` <img src="week-03_files/figure-html/unnamed-chunk-22-1.svg" width="60%" style="display: block; margin: auto;" /> --- ### Aesthetic mapping (advanced topics!) Aesthetics can be mapped at the level of the individual geom layer, and so can data... This can be confusing but actually allows you a lot of flexibility... ```r ggplot() + # Note, this is empty! geom_point(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp), alpha = .2) ``` <img src="week-03_files/figure-html/unnamed-chunk-23-1.svg" width="50%" style="display: block; margin: auto;" /> --- ### Aesthetic mapping (advanced topics!) And then... ```r ggplot() + geom_point(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp), alpha = .2) + * geom_point(data = gapminder[gapminder$country == "Australia",], * mapping = aes(x = gdpPercap, * y = lifeExp), * colour = "orange", size = 4) ``` <img src="week-03_files/figure-html/unnamed-chunk-24-1.svg" width="40%" style="display: block; margin: auto;" /> --- ## How defining data and aesthetics for layers Data (`data = <data>`) and aesthetics (`mapping = aes(<aesthetics>)`) are defined top-down. Each layer will look for a definition of the two above if not directly provided. ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_<type>() + geom_<type>() ``` Let's try this out... --- class: inverse, center, middle # Geometries --- ### What geometries can I use? A lot! There are about 50 geometries you can use in ggplot2. A complete list is [here](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf). ### Which one should I use? It depends on the variables you plan to visualise... #### Consider these scenarios 1. ONE VARIABLE continuous 2. ONE VARIABLE discrete 3. TWO VARIABLES continuous x, continuous y 4. TWO VARIABLE discrete x, continuous y 5. THREE VARIABLES --- ## ONE VARIABLE continuous ```r g <- ggplot(data = gapminder, aes(x = gdpPercap)) ``` .pull-left[ ### Histogram ```r g + geom_histogram() ``` <img src="week-03_files/figure-html/unnamed-chunk-27-1.svg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ ### Density ```r g + geom_density() ``` <img src="week-03_files/figure-html/unnamed-chunk-28-1.svg" width="80%" style="display: block; margin: auto;" /> ] --- ## ONE VARIABLE discrete ```r g <- ggplot(data = gapminder, aes(x = continent)) ``` ### Bars ```r g + geom_bar() ``` <img src="week-03_files/figure-html/unnamed-chunk-30-1.svg" width="40%" style="display: block; margin: auto;" /> --- ## TWO VARIABLES continuous x, continuous y ```r g <- ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) ``` .pull-left[ ### Point ```r g + geom_point() ``` <img src="week-03_files/figure-html/unnamed-chunk-32-1.svg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ ### Text (or label) ```r g + geom_text(aes(label = continent)) # try also geom_label ``` <img src="week-03_files/figure-html/unnamed-chunk-33-1.svg" width="80%" style="display: block; margin: auto;" /> ] --- ## TWO VARIABLES continuous x, continuous y (cont. function) ```r g <- ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) ``` ### Smooth ```r *g + geom_point() + geom_smooth(method = "lm") # Try method = 'loess' ``` <img src="week-03_files/figure-html/unnamed-chunk-35-1.svg" width="30%" style="display: block; margin: auto;" /> --- ## TWO VARIABLES continuous x, continuous y ```r g <- ggplot(data = gapminder, aes(x = year, y = lifeExp)) ``` ### Line ```r *g + geom_line(aes(group = country)) ``` <img src="week-03_files/figure-html/unnamed-chunk-37-1.svg" width="30%" style="display: block; margin: auto;" /> --- ## What's the "group" aesthetic again? Let's try reproducing the same `geom_line()` as before, *without* setting `group` within `aes()`. ```r g + geom_line() ``` <img src="week-03_files/figure-html/unnamed-chunk-38-1.svg" width="30%" style="display: block; margin: auto;" /> Clearly, not good! the `geom_line()` doesn't know that year are grouped by country so will mesh every value with the same year in the same line! --- ## What's the "group" aesthetic again? <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 1982 </td> <td style="text-align:right;"> 39.854 </td> <td style="text-align:right;"> 12881816 </td> <td style="text-align:right;"> 978.0114 </td> </tr> <tr> <td style="text-align:left;"> Albania </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1982 </td> <td style="text-align:right;"> 70.420 </td> <td style="text-align:right;"> 2780097 </td> <td style="text-align:right;"> 3630.8807 </td> </tr> <tr> <td style="text-align:left;"> Algeria </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 1982 </td> <td style="text-align:right;"> 61.368 </td> <td style="text-align:right;"> 20033753 </td> <td style="text-align:right;"> 5745.1602 </td> </tr> <tr> <td style="text-align:left;"> Angola </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 1982 </td> <td style="text-align:right;"> 39.942 </td> <td style="text-align:right;"> 7016384 </td> <td style="text-align:right;"> 2756.9537 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1982 </td> <td style="text-align:right;"> 69.942 </td> <td style="text-align:right;"> 29341374 </td> <td style="text-align:right;"> 8997.8974 </td> </tr> </tbody> </table> By setting `aes(group = country)` we specify that each country needs its own different line. --- ## TWO VARIABLE discrete x, continuous y ```r g <- ggplot(data = gapminder, aes(x = continent)) ``` .pull-left[ ### Columns ```r g + geom_col(aes(y = pop)) ``` <img src="week-03_files/figure-html/unnamed-chunk-41-1.svg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ ### Boxplot ```r g + geom_boxplot(aes(y = lifeExp)) ``` <img src="week-03_files/figure-html/unnamed-chunk-42-1.svg" width="80%" style="display: block; margin: auto;" /> ] --- ## THREE VARIABLES ```r g <- ggplot(data = gapminder, aes(x = year, y = continent)) ``` ```r g + geom_tile(aes(fill = lifeExp)) + scale_fill_viridis_c() ``` <img src="week-03_files/figure-html/unnamed-chunk-44-1.svg" width="90%" style="display: block; margin: auto;" /> What is going on here? --- class: inverse, center, middle # Position adjustement --- ## Position scales The `scale_<mapping>_<type>()` functions allows you to position scales for your data. It basically, allows to add specifics about how your data/aesthetic mapping should behave. Common scenarios are ... --- ## Position scales ### Setting the colour palette for your mapping ```r ggplot(data = gapminder, aes(x = continent, fill = continent)) + geom_col(aes(y = pop)) + * scale_fill_brewer(palette = "Set1") ``` <img src="week-03_files/figure-html/unnamed-chunk-45-1.svg" width="40%" style="display: block; margin: auto;" /> --- ## Position scales ### Setting the limits of your scale to zoom within your chart in some range on the y (vertical axis); ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() + * scale_y_continuous(limits = c(50, 60)) ``` <img src="week-03_files/figure-html/unnamed-chunk-46-1.svg" width="40%" style="display: block; margin: auto;" /> --- ## Position scales ### Setting the colour palette for your mapping ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = lifeExp)) + geom_point() + * scale_colour_viridis_c() ``` <img src="week-03_files/figure-html/unnamed-chunk-47-1.svg" width="40%" style="display: block; margin: auto;" /> --- ## Position scales ### Using a scale transformation ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() + * scale_x_log10() ``` <img src="week-03_files/figure-html/unnamed-chunk-48-1.svg" width="40%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Plot labels --- To set the labels of your plot's elements - *title*, *axes*, *legends* - you add a `labs()` layer. It is pretty self-explanatory: ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = lifeExp)) + geom_point() + scale_colour_viridis_c() + labs(x = "My x-axis label", y = "My y-axis label", colour = "My colour-legend title,", title = "my plot's title...", subtitle = "my subtitle", caption = "...and my caption") ``` <img src="week-03_files/figure-html/unnamed-chunk-49-1.svg" width="75%" style="display: block; margin: auto;" /> --- class: inverse, center, middle ## Facets --- What if you want to make your visualisation easier to understand by having multiple panels mapping the same variables but for different groups? You can do it with the ggplot's **facets**. Remember the messy plots with a lot of spaghetti-lines? ```r g <- ggplot(data = gapminder, aes(x = year, y = lifeExp)) + geom_line(aes(group = country)) ``` Let's make it better, with `facet_grid()`: ```r *g + facet_grid(~continent) ``` <img src="week-03_files/figure-html/unnamed-chunk-51-1.svg" width="100%" style="display: block; margin: auto;" /> What happened? Try `facet_grid("continent")` instead... --- ### The main elements of ggplot’s grammar of graphics .pull-left[ .center[<img src = 'https://socviz.co/assets/ch-03-ggplot-flow-vertical.png' width = '40%'></img>]] .pull-right[Check out the website containing an almost integral version of Kieran Healy's 🔥 book 🔥. .center[ <img src = 'https://socviz.co/assets/dv-cover-pupress.jpg' width = '65%'></img> </br> https://socviz.co/ ]]