class: center, middle, title-slide .title[ # Logit ] .subtitle[ ##
Big Data and Data Visualisation ] .author[ ### Marcell Granát ] .institute[ ### Central Bank of Hungary & .blue[John von Neumann University] ] .date[ ### 2023 ] --- <style type="text/css"> .red { color: red; } .blue { color: #378C95; } strong { color: red; } a { color: #378C95; font-weight: bold; } .remark-inline-code { font-weight: 900; background-color: #a7d5e7; } .caption { color: #378C95; font-style: italic; text-align: center; } .content-box { box-sizing: content-box; background-color: #378C95; /* Total width: 160px + (2 * 20px) + (2 * 8px) = 216px Total height: 80px + (2 * 20px) + (2 * 8px) = 136px Content box width: 160px Content box height: 80px */ } .content-box-green { background-color: #d9edc2; } .content-box-red { background-color: #f9dbdb; } .fullprice { text-decoration: line-through; } </style> ## Motivation - It has already been discussed how to incorporate categorical variables as regressors into the model - Today we will talk about what to do if the outcome variable is categorical (For simplicity, we only look at cases where **two outcomes** are possible) - An illustrative example: [english](https://en-hitelintezetiszemle.mnb.hu/letoltes/fer-22-3-st2-granat-neszveda-szabo.pdf) / [hungarian](https://hitelintezetiszemle.mnb.hu/letoltes/hsz-22-3-t2-granat-neszveda-szabo.pdf) --- ## Introduction > "For a number of reasons, the government bond yield curve is proving to be an accurate predictor of recessions in the US" <p align="center"><iframe width="720" height="405" src="https://www.youtube.com/embed/oW4hfaiXKG8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p> --- ## Setup - Recession data NBER publishes officially the quarters of recessions [https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions](https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions) <img src="econometrics3/nber.png" width="500px" height="350px" style="display: block; margin: auto;" /> Let's take that table! --- ## Setup - Recession data ```r nber <- rvest::read_html( str_c( "https://www.nber.org/research/data/", "us-business-cycle-expansions-and-contractions" ) ) |> rvest::html_table(fill = TRUE) |> first() |> select(start = X1, end = X2) |> mutate( across(everything(), ~ str_extract(., pattern = "\\d{4}Q\\d")), across(everything(), lubridate::yq), ) |> drop_na() ``` --- ## Setup - Recession data ```r recession_dates <- map2( nber$start, nber$end, ~ seq.Date(from = .x, to = .y, by = "month") ) |> reduce(c) recession_usa_df <- tibble( date = seq.Date( from = as.Date("1960-01-01"), to = as.Date("2020-01-01"), by = "months" ) ) |> mutate(recession = date %in% recession_dates) ``` --- ## Setup - Yield data Let's take it from [DBnomics](https://db.nomics.world/) ```r yield1_df <- rdbnomics::rdb(ids = "FED/NOMINAL_YIELD_CURVE/SVENPY01") %>% tibble() %>% transmute( time = lubridate::ymd(original_period), yield1 = value ) yield10_df <- rdbnomics::rdb(ids = "FED/NOMINAL_YIELD_CURVE/SVENPY10") %>% tibble() %>% transmute( time = lubridate::ymd(original_period), yield10 = value ) ``` --- ```r yield_df <- full_join(yield1_df, yield10_df) %>% drop_na() %>% mutate(yield_spread = yield10 - yield1) %>% group_by( date = lubridate::floor_date(time, "quarter") ) %>% summarise( spread = prod(yield_spread / 100 + 1) ^ (1/n()) - 1 ) yield_df ``` ``` ## # A tibble: 209 × 2 ## date spread ## <date> <dbl> ## 1 1971-07-01 0.00913 ## 2 1971-10-01 0.0134 ## 3 1972-01-01 0.0176 ## 4 1972-04-01 0.0139 ## 5 1972-07-01 0.0115 ## 6 1972-10-01 0.00858 ## 7 1973-01-01 0.00183 ## 8 1973-04-01 -0.00310 ## 9 1973-07-01 -0.0131 ## 10 1973-10-01 -0.00606 ## # ℹ 199 more rows ``` ## Setup - Join the two sets ```r inner_join( x = yield_df, y = recession_usa_df ) ``` ``` ## # A tibble: 195 × 3 ## date spread recession ## <date> <dbl> <lgl> ## 1 1971-07-01 0.00913 FALSE ## 2 1971-10-01 0.0134 FALSE ## 3 1972-01-01 0.0176 FALSE ## 4 1972-04-01 0.0139 FALSE ## 5 1972-07-01 0.0115 FALSE ## 6 1972-10-01 0.00858 FALSE ## 7 1973-01-01 0.00183 FALSE ## 8 1973-04-01 -0.00310 FALSE ## 9 1973-07-01 -0.0131 FALSE ## 10 1973-10-01 -0.00606 TRUE ## # ℹ 185 more rows ``` ## The logit Let P_i denote the probability of being `TRUE`. The estimated function is the following: `$$\text{log}\frac{P_i}{1-P_i}=\beta_0+\sum_{j=1}^{k}\beta_jx_{i,j}$$` - The left side of the equation is called **log odds-ratio** - This model is not estimated by OLS! -- - We must introduce the likelihood-function: `$$L=\prod_{y_i=1}P_i\prod_{y_i=0}(1-P_i)$$` .blue[How would you interpret it?] -- > Maximization of the likelihood function for either the probit or the logit model is accomplished by .blue[iterative] nonlinear estimation methods. There are now several computer programs available for probit and logit analysis, and these programs are very inexpensive to run. --- ### The Logit in R ```r df <- inner_join( x = yield_df, y = recession_usa_df ) %>% mutate(spread = lag(spread, 4)) ``` ```r fit <- glm(recession ~ spread, binomial(link = "logit"), data = df) ``` -- The broom functions also work here! Extract the coeffitients: ```r broom::tidy(fit) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.627 0.254 -2.47 0.0136 ## 2 spread -156. 28.1 -5.56 0.0000000264 ``` --- ### The Logit in R Predictions & errors for the observations: ```r broom::augment(fit, type.predict = "response") ``` ``` ## # A tibble: 191 × 9 ## .rownames recession spread .fitted .resid .hat .sigma .cooksd .std.resid ## <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5 FALSE 0.00913 0.114 -0.491 0.00848 0.784 5.53e-4 -0.493 ## 2 6 FALSE 0.0134 0.0620 -0.358 0.00803 0.785 2.70e-4 -0.359 ## 3 7 FALSE 0.0176 0.0331 -0.259 0.00703 0.785 1.22e-4 -0.260 ## 4 8 FALSE 0.0139 0.0569 -0.342 0.00793 0.785 2.43e-4 -0.344 ## 5 9 FALSE 0.0115 0.0807 -0.410 0.00827 0.785 3.69e-4 -0.412 ## 6 10 TRUE 0.00858 0.122 2.05 0.00853 0.771 3.11e-2 2.06 ## 7 11 TRUE 0.00183 0.286 1.58 0.0118 0.777 1.51e-2 1.59 ## 8 12 TRUE -0.00310 0.465 1.24 0.0218 0.780 1.32e-2 1.25 ## 9 13 TRUE -0.0131 0.805 0.658 0.0417 0.784 5.48e-3 0.672 ## 10 14 TRUE -0.00606 0.580 1.04 0.0303 0.781 1.17e-2 1.06 ## # ℹ 181 more rows ``` --- ### Reproduce the table from Estrella és Fishkin [1996b] Estrella és Fishkin [1996b] reports a table that contains the probability of recession to certain values of spread. -- To reproduce this table we use our own model to estimate the probability of recession.footnote[The results will differ a bit. This is because the time-series of spread is longer in the article & they use probit model (the link function is different, but does not cause a huge difference)]. -- The `augment` function has a `newdata` input parameter, so we generate the predictions so easily with that. (This works also in the case of other linear models). --- ### Reproduce the table from Estrella és Fishkin [1996b] ```r df %>% filter(date < "1995-01-01") %>% glm(recession ~ spread, binomial(link = "probit"), data = .) %>% broom::augment(fit, type.predict = "response", newdata=tibble(spread = seq( from = -2.5, to = 1.4, by = .40) )) ``` ``` ## # A tibble: 10 × 2 ## spread .fitted ## <dbl> <dbl> ## 1 -2.5 1 e+ 0 ## 2 -2.1 1 e+ 0 ## 3 -1.7 1 e+ 0 ## 4 -1.3 1 e+ 0 ## 5 -0.9 1 e+ 0 ## 6 -0.5 1 e+ 0 ## 7 -0.100 1 e+ 0 ## 8 0.300 2.22e-16 ## 9 0.7 2.22e-16 ## 10 1.1 2.22e-16 ``` --- ### ROC curve #### When do we say that "the winter is coming"? If it exceeds a certain probability... = **Cut-off value** ```r cut_off = 0.5 broom::augment(fit, type.predict = "response") %>% mutate( estimation = .fitted > cut_off, ) %>% select(recession, spread, .fitted, estimation) ``` ``` ## # A tibble: 191 × 4 ## recession spread .fitted estimation ## <lgl> <dbl> <dbl> <lgl> ## 1 FALSE 0.00913 0.114 FALSE ## 2 FALSE 0.0134 0.0620 FALSE ## 3 FALSE 0.0176 0.0331 FALSE ## 4 FALSE 0.0139 0.0569 FALSE ## 5 FALSE 0.0115 0.0807 FALSE ## 6 TRUE 0.00858 0.122 FALSE ## 7 TRUE 0.00183 0.286 FALSE ## 8 TRUE -0.00310 0.465 FALSE ## 9 TRUE -0.0131 0.805 TRUE ## 10 TRUE -0.00606 0.580 TRUE ## # ℹ 181 more rows ``` --- ### ROC curve #### When do we say that "the winter is coming"? If it exceeds a certain probability... = **Cut-off value** ```r cut_off = 0.5 broom::augment(fit, type.predict = "response") %>% mutate( estimation = .fitted > cut_off, correct = recession == estimation ) %>% count(correct, estimation) %>% mutate(estimation = ifelse(estimation == TRUE, "positive", "negative")) ``` ``` ## # A tibble: 4 × 3 ## correct estimation n ## <lgl> <chr> <int> ## 1 FALSE negative 22 ## 2 FALSE positive 5 ## 3 TRUE negative 155 ## 4 TRUE positive 9 ``` --- ### ROC curve #### When do we say that "the winter is coming"? If it exceeds a certain probability... = **Cut-off value** ```r confusion_matrix <- function(.fit, cut_off = .5) { broom::augment(fit, type.predict = "response") %>% mutate( estimation = .fitted > cut_off, correct = recession == estimation ) %>% count(correct, estimation) %>% mutate(estimation = ifelse(estimation == TRUE, "positive", "negative")) } confusion_matrix(fit, .7) ``` ``` ## # A tibble: 4 × 3 ## correct estimation n ## <lgl> <chr> <int> ## 1 FALSE negative 24 ## 2 FALSE positive 3 ## 3 TRUE negative 157 ## 4 TRUE positive 7 ``` --- ### ROC curve #### When do we say that "the winter is coming"? **If it exceeds a certain probability... = Cut-off value** Some possible cut-off values: - .5 - return the same number of `TRUE` predictions as `TRUE` observations - maximizing the accuracy - minimizing cost (if you know that) --- ### ROC curve Let's construct the confusion matrix to a given a cut-off value Now, the following indicators can be derived: `$$\text{Sensitivity}=\frac{TP}{TP+FN}$$` `$$\text{Specificity}=\frac{TN}{TN+FP}$$` This two values can be calculated to any cut-off values... This is the ROC curve --- ### ROC curve in R ```r roc_curve <- broom::augment(fit, type.predict = "response") %$% pROC::roc(recession, .fitted) ``` Area under the curve: ```r roc_curve$auc ``` ``` ## Area under the curve: 0.8849 ``` .content-box-greeen[ AUC is the suggested indicator to evaluate your model and compare it to another one! ] ```r roc_curve_df <- roc_curve %$% tibble(thresholds, sensitivities, specificities) ``` --- ### Visualization of the ROC curve <img src="econometrics3_files/figure-html/unnamed-chunk-19-1.png" width="700px" height="450px" style="display: block; margin: auto;" /> --- # References <p><cite>Ramanathan, R. (1992). <em>Introductory econometrics with applications</em>. Dryden Press.</cite></p> <p><cite>Estrella, A. and F. S. Mishkin (1996). “The yield curve as a predictor of US recessions”. In: <em>Current issues in economics and finance</em> 2.7.</cite></p> --- class: center, middle # Thank you for your attention! Slides are available at [www.marcellgranat.com](https://www.marcellgranat.com)