SSPS4102Data Analytics in the Social Sciences

class: center, middle, inverse, title-slide

.title[
# SSPS4102</br>Data Analytics in the Social Sciences
]
.subtitle[
## Week 11</br>predict()
]
.author[
### Francesco Bailo
]
.institute[
### The University of Sydney
]
.date[
### Semester 1, 2023 (updated: 2023-05-10)
]

---

background-image: url(https://upload.wikimedia.org/wikipedia/en/6/6a/Logo_of_the_University_of_Sydney.svg)
background-size: 95%

---

## Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The  University of Sydney is located on the land of the Gadigal people  of the Eora Nation. I pay my respects to their Elders, past and present.

---
class: inverse, center, middle

# predict()

---

## Example: US midterm elections<sup>1</sup>

| Variable name | Description |
| ------------- | ----------- |
| `year` | midterm election year |
| `president` | name of president |
| `party` | Democrat or Republican |
| `approval` | Gallup approval rating at midterms |
| `seat.change` | change in the number of House seat's for the president's party |
| `rdi.change` | change in real disposable income over the year before |

.footnote[[1] Slides from http://www.mattblackwell.org/files/teaching/gov50/regression-ii.pdf]

---

```r
midterms <- read.csv("../data/midterms.csv")
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> year </th>
   <th style="text-align:left;"> president </th>
   <th style="text-align:left;"> party </th>
   <th style="text-align:right;"> approval </th>
   <th style="text-align:right;"> seat.change </th>
   <th style="text-align:right;"> rdi.change </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1946 </td>
   <td style="text-align:left;"> Truman </td>
   <td style="text-align:left;"> D </td>
   <td style="text-align:right;"> 33 </td>
   <td style="text-align:right;"> -55 </td>
   <td style="text-align:right;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1950 </td>
   <td style="text-align:left;"> Truman </td>
   <td style="text-align:left;"> D </td>
   <td style="text-align:right;"> 39 </td>
   <td style="text-align:right;"> -29 </td>
   <td style="text-align:right;"> 8.2 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1954 </td>
   <td style="text-align:left;"> Eisenhower </td>
   <td style="text-align:left;"> R </td>
   <td style="text-align:right;"> 61 </td>
   <td style="text-align:right;"> -4 </td>
   <td style="text-align:right;"> 1.0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1958 </td>
   <td style="text-align:left;"> Eisenhower </td>
   <td style="text-align:left;"> R </td>
   <td style="text-align:right;"> 57 </td>
   <td style="text-align:right;"> -47 </td>
   <td style="text-align:right;"> 1.1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1962 </td>
   <td style="text-align:left;"> Kennedy </td>
   <td style="text-align:left;"> D </td>
   <td style="text-align:right;"> 61 </td>
   <td style="text-align:right;"> -4 </td>
   <td style="text-align:right;"> 5.0 </td>
  </tr>
</tbody>
</table>

---

## Linear regression

```r
fit <- lm(seat.change ~ approval, data = midterms)

summary(fit)
```

```
## 
## Call:
## lm(formula = seat.change ~ approval, data = midterms)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.348 -10.913   6.091  11.473  26.867 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -96.8448    21.2569  -4.556 0.000324 ***
## approval      1.4244     0.4094   3.479 0.003096 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.41 on 16 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4307,	Adjusted R-squared:  0.3951 
## F-statistic: 12.11 on 1 and 16 DF,  p-value: 0.003096
```

---

## Using predict()

The function `predict()` takes the result from a `lm()` function and any value for the `$X$` and predict, based on the regression line, the average (expected )value for `$Y$` given `$x$`.

```r
summary(midterms$approval)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33.00   41.00   49.00   50.26   59.50   66.00
```

```r
my_new_data <- 
  data.frame(approval = c(20, 80))

predict(fit, newdata = my_new_data)
```

```
##         1         2 
## -68.35610  17.10995
```

---

## Using predict()

To obtain the 95% confidence intervals around the mean the predictions, we can add `interval = "confidence"`.

```r
seat.change_prediction <- 
  predict(fit, newdata = my_new_data, interval = "confidence")
seat.change_prediction
```

```
##         fit       lwr       upr
## 1 -68.35610 -96.58698 -40.12522
## 2  17.10995  -9.56615  43.78605
```

---

We can plot the results from predict with

```r
# First we need to create a data.frame adding the `approval` rate
# that we want to use as input values
data.frame(seat.change_prediction,
           approval = c(20, 80)) %>% 
# Then we plot  
  ggplot(aes(y = fit, 
             ymin = lwr, ymax = upr, 
             x = approval)) +
  geom_point() +
  geom_errorbar() +
  labs(y = "seat.change")
```

---

## Example: US midterm elections<sup>1</sup>

.footnote[[1] Slides from http://www.mattblackwell.org/files/teaching/gov50/regression-ii.pdf]

---

## Using predict() with two predictors

```r
fit <- lm(seat.change ~ approval + rdi.change, data = midterms)

my_new_data <- 
  data.frame(approval = c(20, 50, 80),
             rdi.change = median(midterms$rdi.change, na.rm = T))
```

What is happening with `my_new_data`?

I have added three values for `approval` kept `rdi.change` fixes at the level of the median value for the observations.

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> approval </th>
   <th style="text-align:right;"> rdi.change </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 5.05 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 50 </td>
   <td style="text-align:right;"> 5.05 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 80 </td>
   <td style="text-align:right;"> 5.05 </td>
  </tr>
</tbody>
</table>

---

Let's predict and plot...

```r
seat.change_prediction <- 
  predict(fit, newdata = my_new_data, interval = "confidence")

data.frame(seat.change_prediction,
           my_new_data) %>% 
# Then we plot  
  ggplot(aes(y = fit, 
             ymin = lwr, ymax = upr, 
             x = approval)) +
  geom_point() +
  geom_errorbar() +
  labs(y = "seat.change")
```

---

* The error plot, with estimate and confidence interval is among the best way to present the result of your models (usually, better the the regression table).

* The error plot allows to explore "what-if" scenario based on your regression analysis (that is, using the line of best fit to predict `$Y$` values)