Logit

class: center, middle, title-slide

.title[
# Logit
]
.subtitle[
## <img src="mnb_intezet.png" style="width:30.0%" />
Big Data and Data Visualisation
]
.author[
### Marcell Granát
]
.institute[
### Central Bank of Hungary & .blue[John von Neumann University]
]
.date[
### 2023
]

---

.content-box { 
box-sizing: content-box;
background-color: #378C95;
/* Total width: 160px + (2 * 20px) + (2 * 8px) = 216px
Total height: 80px + (2 * 20px) + (2 * 8px) = 136px
Content box width: 160px
Content box height: 80px */
}

.content-box-green {
background-color: #d9edc2;
}

.content-box-red {
background-color: #f9dbdb;
}

.fullprice {
text-decoration: line-through;
}
</style>

## Motivation

- It has already been discussed how to incorporate categorical variables as regressors into the model

- Today we will talk about what to do if the outcome variable is categorical (For simplicity, we only look at cases where **two outcomes** are possible)

- An illustrative example: [english](https://en-hitelintezetiszemle.mnb.hu/letoltes/fer-22-3-st2-granat-neszveda-szabo.pdf) / [hungarian](https://hitelintezetiszemle.mnb.hu/letoltes/hsz-22-3-t2-granat-neszveda-szabo.pdf)

---

## Introduction

> "For a number of reasons, the government bond yield curve is proving to be an accurate predictor of recessions in the US"

---

## Setup - Recession data

NBER publishes officially the quarters of recessions

[https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions](https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions)

Let's take that table!

---

## Setup - Recession data

```r
nber <- rvest::read_html(
      str_c(
        "https://www.nber.org/research/data/",
        "us-business-cycle-expansions-and-contractions"
      )
    ) |>
      rvest::html_table(fill = TRUE) |>
      first() |>
      select(start = X1, end = X2) |>
      mutate(
        across(everything(), ~ str_extract(., pattern = "\\d{4}Q\\d")),
        across(everything(), lubridate::yq),
      ) |>
      drop_na()
```

---

## Setup - Recession data

```r
recession_dates <- map2(
      nber$start, nber$end, 
      ~ seq.Date(from = .x, to = .y, by = "month")
    ) |>
      reduce(c)
    
recession_usa_df <- tibble(
      date = seq.Date(
        from = as.Date("1960-01-01"), 
        to = as.Date("2020-01-01"),
        by = "months"
      )
    ) |> 
      mutate(recession = date %in% recession_dates)
```

---

## Setup - Yield data

Let's take it from [DBnomics](https://db.nomics.world/)

```r
yield1_df <- rdbnomics::rdb(ids = "FED/NOMINAL_YIELD_CURVE/SVENPY01") %>% 
  tibble() %>% 
  transmute(
    time = lubridate::ymd(original_period),
    yield1 = value
  )

yield10_df <- rdbnomics::rdb(ids = "FED/NOMINAL_YIELD_CURVE/SVENPY10") %>% 
  tibble() %>% 
  transmute(
    time = lubridate::ymd(original_period),
    yield10 = value
  )
```

---

```r
yield_df <- full_join(yield1_df, yield10_df) %>% 
  drop_na() %>% 
  mutate(yield_spread = yield10 - yield1) %>% 
  group_by(
    date = lubridate::floor_date(time, "quarter")
  ) %>% 
  summarise(
    spread = prod(yield_spread / 100 + 1) ^ (1/n()) - 1
  )

yield_df
```

```
## # A tibble: 209 × 2
##    date         spread
##    <date>        <dbl>
##  1 1971-07-01  0.00913
##  2 1971-10-01  0.0134 
##  3 1972-01-01  0.0176 
##  4 1972-04-01  0.0139 
##  5 1972-07-01  0.0115 
##  6 1972-10-01  0.00858
##  7 1973-01-01  0.00183
##  8 1973-04-01 -0.00310
##  9 1973-07-01 -0.0131 
## 10 1973-10-01 -0.00606
## # ℹ 199 more rows
```

## Setup - Join the two sets

```r
inner_join(
  x = yield_df,
  y = recession_usa_df
)
```

```
## # A tibble: 195 × 3
##    date         spread recession
##    <date>        <dbl> <lgl>    
##  1 1971-07-01  0.00913 FALSE    
##  2 1971-10-01  0.0134  FALSE    
##  3 1972-01-01  0.0176  FALSE    
##  4 1972-04-01  0.0139  FALSE    
##  5 1972-07-01  0.0115  FALSE    
##  6 1972-10-01  0.00858 FALSE    
##  7 1973-01-01  0.00183 FALSE    
##  8 1973-04-01 -0.00310 FALSE    
##  9 1973-07-01 -0.0131  FALSE    
## 10 1973-10-01 -0.00606 TRUE     
## # ℹ 185 more rows
```

## The logit

Let P_i denote the probability of being `TRUE`. The estimated function is the following:

`$$\text{log}\frac{P_i}{1-P_i}=\beta_0+\sum_{j=1}^{k}\beta_jx_{i,j}$$`

- The left side of the equation is called **log odds-ratio**

- This model is not estimated by OLS!

- We must introduce the likelihood-function:

`$$L=\prod_{y_i=1}P_i\prod_{y_i=0}(1-P_i)$$`

.blue[How would you interpret it?]

> Maximization of the likelihood function for either the probit or the logit model is accomplished by .blue[iterative] nonlinear estimation methods. There are now several computer programs available for probit and logit analysis, and these programs are very inexpensive to run.

---

### The Logit in R

```r
df <- inner_join(
  x = yield_df,
  y = recession_usa_df
) %>% 
  mutate(spread = lag(spread, 4))
```

```r
fit <- glm(recession ~ spread, binomial(link = "logit"), data = df)
```

The broom functions also work here!

Extract the coeffitients:

```r
broom::tidy(fit)
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)   -0.627     0.254     -2.47 0.0136      
## 2 spread      -156.       28.1       -5.56 0.0000000264
```

---

### The Logit in R

Predictions & errors for the observations:

```r
broom::augment(fit, type.predict = "response")
```

```
## # A tibble: 191 × 9
##    .rownames recession   spread .fitted .resid    .hat .sigma .cooksd .std.resid
##    <chr>     <lgl>        <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>      <dbl>
##  1 5         FALSE      0.00913  0.114  -0.491 0.00848  0.784 5.53e-4     -0.493
##  2 6         FALSE      0.0134   0.0620 -0.358 0.00803  0.785 2.70e-4     -0.359
##  3 7         FALSE      0.0176   0.0331 -0.259 0.00703  0.785 1.22e-4     -0.260
##  4 8         FALSE      0.0139   0.0569 -0.342 0.00793  0.785 2.43e-4     -0.344
##  5 9         FALSE      0.0115   0.0807 -0.410 0.00827  0.785 3.69e-4     -0.412
##  6 10        TRUE       0.00858  0.122   2.05  0.00853  0.771 3.11e-2      2.06 
##  7 11        TRUE       0.00183  0.286   1.58  0.0118   0.777 1.51e-2      1.59 
##  8 12        TRUE      -0.00310  0.465   1.24  0.0218   0.780 1.32e-2      1.25 
##  9 13        TRUE      -0.0131   0.805   0.658 0.0417   0.784 5.48e-3      0.672
## 10 14        TRUE      -0.00606  0.580   1.04  0.0303   0.781 1.17e-2      1.06 
## # ℹ 181 more rows
```

---

### Reproduce the table from Estrella és Fishkin [1996b]

Estrella és Fishkin [1996b] reports a table that contains the probability of recession to certain values of spread.

To reproduce this table we use our own model to estimate the probability of recession.footnote[The results will differ a bit. This is because the time-series of spread is longer in the article & they use probit model (the link function is different, but does not cause a huge difference)].

The `augment` function has a `newdata` input parameter, so we generate the predictions so easily with that. (This works also in the case of other linear models).

---

### Reproduce the table from Estrella és Fishkin [1996b]

```r
df %>% 
  filter(date < "1995-01-01") %>% 
  glm(recession ~ spread, binomial(link = "probit"), data = .) %>% 
  broom::augment(fit, type.predict = "response", 
                 newdata=tibble(spread = seq(
                   from = -2.5, 
                   to = 1.4, 
                   by = .40)
                 ))
```

```
## # A tibble: 10 × 2
##    spread  .fitted
##     <dbl>    <dbl>
##  1 -2.5   1   e+ 0
##  2 -2.1   1   e+ 0
##  3 -1.7   1   e+ 0
##  4 -1.3   1   e+ 0
##  5 -0.9   1   e+ 0
##  6 -0.5   1   e+ 0
##  7 -0.100 1   e+ 0
##  8  0.300 2.22e-16
##  9  0.7   2.22e-16
## 10  1.1   2.22e-16
```

---

### ROC curve

#### When do we say that "the winter is coming"?

If it exceeds a certain probability... = **Cut-off value**

```r
cut_off = 0.5

broom::augment(fit, type.predict = "response") %>% 
  mutate(
    estimation = .fitted > cut_off,
  ) %>% 
  select(recession, spread, .fitted, estimation)
```

```
## # A tibble: 191 × 4
##    recession   spread .fitted estimation
##    <lgl>        <dbl>   <dbl> <lgl>     
##  1 FALSE      0.00913  0.114  FALSE     
##  2 FALSE      0.0134   0.0620 FALSE     
##  3 FALSE      0.0176   0.0331 FALSE     
##  4 FALSE      0.0139   0.0569 FALSE     
##  5 FALSE      0.0115   0.0807 FALSE     
##  6 TRUE       0.00858  0.122  FALSE     
##  7 TRUE       0.00183  0.286  FALSE     
##  8 TRUE      -0.00310  0.465  FALSE     
##  9 TRUE      -0.0131   0.805  TRUE      
## 10 TRUE      -0.00606  0.580  TRUE      
## # ℹ 181 more rows
```

---

### ROC curve

#### When do we say that "the winter is coming"?

If it exceeds a certain probability... = **Cut-off value**

```r
cut_off = 0.5

broom::augment(fit, type.predict = "response") %>% 
  mutate(
    estimation = .fitted > cut_off,
    correct = recession == estimation
  ) %>% 
  count(correct, estimation) %>% 
  mutate(estimation = ifelse(estimation == TRUE, "positive", "negative"))
```

```
## # A tibble: 4 × 3
##   correct estimation     n
##   <lgl>   <chr>      <int>
## 1 FALSE   negative      22
## 2 FALSE   positive       5
## 3 TRUE    negative     155
## 4 TRUE    positive       9
```

---

### ROC curve

#### When do we say that "the winter is coming"?

If it exceeds a certain probability... = **Cut-off value**

```r
confusion_matrix <- function(.fit, cut_off = .5) {
  broom::augment(fit, type.predict = "response") %>% 
    mutate(
      estimation = .fitted > cut_off,
      correct = recession == estimation
    ) %>% 
    count(correct, estimation) %>% 
    mutate(estimation = ifelse(estimation == TRUE, "positive", "negative"))
}
confusion_matrix(fit, .7)
```

```
## # A tibble: 4 × 3
##   correct estimation     n
##   <lgl>   <chr>      <int>
## 1 FALSE   negative      24
## 2 FALSE   positive       3
## 3 TRUE    negative     157
## 4 TRUE    positive       7
```

---

### ROC curve

#### When do we say that "the winter is coming"?

**If it exceeds a certain probability... = Cut-off value**

Some possible cut-off values:

- .5

- return the same number of `TRUE` predictions as `TRUE` observations

- maximizing the accuracy

- minimizing cost (if you know that)

---

### ROC curve

Let's construct the confusion matrix to a given a cut-off value

Now, the following indicators can be derived:

`$$\text{Sensitivity}=\frac{TP}{TP+FN}$$`

`$$\text{Specificity}=\frac{TN}{TN+FP}$$`

This two values can be calculated to any cut-off values... This is the ROC curve

---

### ROC curve in R

```r
roc_curve <- broom::augment(fit, type.predict = "response") %$% 
  pROC::roc(recession, .fitted)
```

Area under the curve:

```r
roc_curve$auc
```

```
## Area under the curve: 0.8849
```

.content-box-greeen[
AUC is the suggested indicator to evaluate your model and compare it to another one!
]

```r
roc_curve_df <- roc_curve %$%
  tibble(thresholds, sensitivities, specificities)
```

---

### Visualization of the ROC curve

---

# References

<p><cite>Ramanathan, R.
(1992).
<em>Introductory econometrics with applications</em>.
Dryden Press.</cite></p>
<p><cite>Estrella, A. and F. S. Mishkin
(1996).
&ldquo;The yield curve as a predictor of US recessions&rdquo;.
In: <em>Current issues in economics and finance</em> 2.7.</cite></p>

---

class: center, middle

# Thank you for your attention!

Slides are available at [www.marcellgranat.com](https://www.marcellgranat.com)