class: center, middle, inverse, title-slide .title[ # SSPS4102Data Analytics in the Social Sciences ] .subtitle[ ## Week 11predict() ] .author[ ### Francesco Bailo ] .institute[ ### The University of Sydney ] .date[ ### Semester 1, 2023 (updated: 2023-05-10) ] --- background-image: url(https://upload.wikimedia.org/wikipedia/en/6/6a/Logo_of_the_University_of_Sydney.svg) background-size: 95% <style> pre { overflow-x: auto; } pre code { word-wrap: normal; white-space: pre; } </style> --- ## Acknowledgement of Country I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present. --- class: inverse, center, middle # predict() --- ## Example: US midterm elections<sup>1</sup> | Variable name | Description | | ------------- | ----------- | | `year` | midterm election year | | `president` | name of president | | `party` | Democrat or Republican | | `approval` | Gallup approval rating at midterms | | `seat.change` | change in the number of House seat's for the president's party | | `rdi.change` | change in real disposable income over the year before | .footnote[[1] Slides from http://www.mattblackwell.org/files/teaching/gov50/regression-ii.pdf] --- ```r midterms <- read.csv("../data/midterms.csv") ``` <table> <thead> <tr> <th style="text-align:right;"> year </th> <th style="text-align:left;"> president </th> <th style="text-align:left;"> party </th> <th style="text-align:right;"> approval </th> <th style="text-align:right;"> seat.change </th> <th style="text-align:right;"> rdi.change </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1946 </td> <td style="text-align:left;"> Truman </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 33 </td> <td style="text-align:right;"> -55 </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 1950 </td> <td style="text-align:left;"> Truman </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> -29 </td> <td style="text-align:right;"> 8.2 </td> </tr> <tr> <td style="text-align:right;"> 1954 </td> <td style="text-align:left;"> Eisenhower </td> <td style="text-align:left;"> R </td> <td style="text-align:right;"> 61 </td> <td style="text-align:right;"> -4 </td> <td style="text-align:right;"> 1.0 </td> </tr> <tr> <td style="text-align:right;"> 1958 </td> <td style="text-align:left;"> Eisenhower </td> <td style="text-align:left;"> R </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> -47 </td> <td style="text-align:right;"> 1.1 </td> </tr> <tr> <td style="text-align:right;"> 1962 </td> <td style="text-align:left;"> Kennedy </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 61 </td> <td style="text-align:right;"> -4 </td> <td style="text-align:right;"> 5.0 </td> </tr> </tbody> </table> --- ## Linear regression ```r fit <- lm(seat.change ~ approval, data = midterms) summary(fit) ``` ``` ## ## Call: ## lm(formula = seat.change ~ approval, data = midterms) ## ## Residuals: ## Min 1Q Median 3Q Max ## -31.348 -10.913 6.091 11.473 26.867 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -96.8448 21.2569 -4.556 0.000324 *** ## approval 1.4244 0.4094 3.479 0.003096 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 17.41 on 16 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.4307, Adjusted R-squared: 0.3951 ## F-statistic: 12.11 on 1 and 16 DF, p-value: 0.003096 ``` --- ## Using predict() The function `predict()` takes the result from a `lm()` function and any value for the `\(X\)` and predict, based on the regression line, the average (expected )value for `\(Y\)` given `\(x\)`. ```r summary(midterms$approval) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 33.00 41.00 49.00 50.26 59.50 66.00 ``` ```r my_new_data <- data.frame(approval = c(20, 80)) predict(fit, newdata = my_new_data) ``` ``` ## 1 2 ## -68.35610 17.10995 ``` --- ## Using predict() To obtain the 95% confidence intervals around the mean the predictions, we can add `interval = "confidence"`. ```r seat.change_prediction <- predict(fit, newdata = my_new_data, interval = "confidence") seat.change_prediction ``` ``` ## fit lwr upr ## 1 -68.35610 -96.58698 -40.12522 ## 2 17.10995 -9.56615 43.78605 ``` --- We can plot the results from predict with ```r # First we need to create a data.frame adding the `approval` rate # that we want to use as input values data.frame(seat.change_prediction, approval = c(20, 80)) %>% # Then we plot ggplot(aes(y = fit, ymin = lwr, ymax = upr, x = approval)) + geom_point() + geom_errorbar() + labs(y = "seat.change") ``` <img src="week-11_files/figure-html/unnamed-chunk-7-1.svg" width="40%" style="display: block; margin: auto;" /> --- ## Example: US midterm elections<sup>1</sup> | Variable name | Description | | ------------- | ----------- | | `year` | midterm election year | | `president` | name of president | | `party` | Democrat or Republican | | `approval` | Gallup approval rating at midterms | | `seat.change` | change in the number of House seat's for the president's party | | `rdi.change` | change in real disposable income over the year before | .footnote[[1] Slides from http://www.mattblackwell.org/files/teaching/gov50/regression-ii.pdf] --- ## Using predict() with two predictors ```r fit <- lm(seat.change ~ approval + rdi.change, data = midterms) my_new_data <- data.frame(approval = c(20, 50, 80), rdi.change = median(midterms$rdi.change, na.rm = T)) ``` What is happening with `my_new_data`? -- I have added three values for `approval` kept `rdi.change` fixes at the level of the median value for the observations. <table> <thead> <tr> <th style="text-align:right;"> approval </th> <th style="text-align:right;"> rdi.change </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 5.05 </td> </tr> <tr> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 5.05 </td> </tr> <tr> <td style="text-align:right;"> 80 </td> <td style="text-align:right;"> 5.05 </td> </tr> </tbody> </table> --- Let's predict and plot... ```r seat.change_prediction <- predict(fit, newdata = my_new_data, interval = "confidence") data.frame(seat.change_prediction, my_new_data) %>% # Then we plot ggplot(aes(y = fit, ymin = lwr, ymax = upr, x = approval)) + geom_point() + geom_errorbar() + labs(y = "seat.change") ``` <img src="week-11_files/figure-html/unnamed-chunk-10-1.svg" width="40%" style="display: block; margin: auto;" /> --- * The error plot, with estimate and confidence interval is among the best way to present the result of your models (usually, better the the regression table). * The error plot allows to explore "what-if" scenario based on your regression analysis (that is, using the line of best fit to predict `\(Y\)` values)