class: center, middle, title-slide .title[ # Bivariate and multivariate OLS ] .subtitle[ ##
Big Data and Data Visualisation ] .author[ ### Marcell Granát ] .institute[ ### Central Bank of Hungary & .blue[John von Neumann University] ] .date[ ### 2023 ] --- <style type="text/css"> .red { color: red; } .blue { color: #378C95; } strong { color: red; } a { color: #378C95; font-weight: bold; } .remark-inline-code { font-weight: 900; background-color: #a7d5e7; } .caption { color: #378C95; font-style: italic; text-align: center; } .content-box { box-sizing: content-box; background-color: #378C95; /* Total width: 160px + (2 * 20px) + (2 * 8px) = 216px Total height: 80px + (2 * 20px) + (2 * 8px) = 136px Content box width: 160px Content box height: 80px */ } .content-box-green { background-color: #d9edc2; } .content-box-red { background-color: #f9dbdb; } .fullprice { text-decoration: line-through; } </style> --- class: inverse, middle, center # The concept --- ## What is econometrics? > "The application of **statistical and mathematical methods** to the analysis of **economic** data, with the purpose of giving **empirical content** to economic theories and verifying them or refuting them." --
-- A prominent tool: **regression** --- class: inverse, middle, center # Bivariate OLS --- ## Basics of OLS ### Amsterdam house data ```r amsterdam_house_df <- read_rds("https://raw.githubusercontent.com/MarcellGranat/big_data2022/main/econometrics_files/amsterdam_house.rds") # source: https://www.kaggle.com/datasets/thomasnibb/amsterdam-house-price-prediction ``` ``` ## # A tibble: 924 × 7 ## Address Zip Price Area Room Lon Lat ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Blasiusstraat 8 2, Amsterdam 1091 CR 685000 64 3 4.91 52.4 ## 2 Kromme Leimuidenstraat 13 H, Amsterdam 1059 EL 475000 60 3 4.85 52.3 ## 3 Zaaiersweg 11 A, Amsterdam 1097 SM 850000 109 4 4.94 52.3 ## 4 Tenerifestraat 40, Amsterdam 1060 TH 580000 128 6 4.79 52.3 ## 5 Winterjanpad 21, Amsterdam 1036 KN 720000 138 5 4.90 52.4 ## 6 De Wittenkade 134 I, Amsterdam 1051 AM 450000 53 2 4.88 52.4 ## 7 Pruimenstraat 18 B, Amsterdam 1033 KM 450000 87 3 4.90 52.4 ## 8 Da Costakade 32 II, Amsterdam 1053 WL 590000 80 2 4.87 52.4 ## 9 Postjeskade 41 2, Amsterdam 1058 DG 399000 49 3 4.85 52.4 ## 10 Van Ostadestraat 193 H, Amsterdam 1073 TM 300000 33 2 4.90 52.4 ## # ℹ 914 more rows ``` --- ## Basics of OLS .blue[What price do we expect, knowing the area of a given house?] Approach: ??? <img src="econometrics1_files/figure-html/unnamed-chunk-6-1.png" width="700px" height="450px" style="display: block; margin: auto;" /> --- ### Amsterdam house data .blue[What price do we expect, knowing the area of a given house?] Approach: Let's fit a straight line. <img src="econometrics1_files/figure-html/unnamed-chunk-7-1.png" width="700px" height="450px" style="display: block; margin: auto;" /> --- ## Basics of OLS .blue[What parameters do we have to know to draw the line?] `$$Y_i = \beta_0 + \beta_1 \times X_i,$$` .pull-left[ where `$$Y_i = \text{The price of a given house}$$` `$$\beta_0 = \text{The intercept of the line}$$` `$$\beta_1 = \text{The slope of the line}$$` `$$X_i = \text{The area of the given house}$$` ] --- ## Basics of OLS .blue[What parameters do we have to know to draw the line?] `$$Y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i,$$` .pull-left[ where `$$Y_i = \text{The price of a given house}$$` `$$\beta_0 = \text{The intercept of the line}$$` `$$\beta_1 = \text{The slope of the line}$$` `$$X_i = \text{The area of the given house}$$` `$$\epsilon_i = \text{The error term}$$` ] .pull-right[ <img src="econometrics_files/meme_error.JPG" width="260px" height="260px" style="display: block; margin: auto;" /> ] -- ### How to choose the value of the intercept and the slope? -- .blue[Minimize the error?] -- The result would be a line above the points... --- ## Basics of OLS .blue[What parameters do we have to know to draw the line?] `$$Y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i,$$` .pull-left[ where `$$Y_i = \text{The price of a given house}$$` `$$\beta_0 = \text{The intercept of the line}$$` `$$\beta_1 = \text{The slope of the line}$$` `$$X_i = \text{The area of the given house}$$` `$$\epsilon_i = \text{The error term}$$` ] .pull-right[ <img src="econometrics_files/meme_error.JPG" width="260px" height="260px" style="display: block; margin: auto;" /> ] ### How to choose the value of the intercept and the slope? .blue[Minimize the squared error?] -- Exactly, that is what OLS is about. OLS = Ordinary Least Squares --- ## Basics of OLS #### Let's calculate the sum of squared errors assuming two random values for the intercept and the slope ```r b0 <- 1000 b1 <- 200 amsterdam_house_df %>% transmute( Price, Area, fit = b0 + Area * b1, e = Price - fit, e2 = e^2 ) %$% sum(e2, na.rm = TRUE) ``` ``` ## [1] 5.91045e+14 ``` --- ## Basics of OLS #### Let's transform this into a function ```r sse <- function(b0 = 1000, b1 = 200) { amsterdam_house_df %>% transmute( Price, Area, fit = b0 + Area * b1, e = Price - fit, e2 = e^2 ) %$% sum(e2, na.rm = TRUE) } sse(1000, 200) ``` ``` ## [1] 5.91045e+14 ``` --- ## Basics of OLS #### Let's apply the function on large set of values ```r sse_df <- crossing( b0 = seq(from = 0, to = 1e4, length.out = 100), b1 = seq(from = 0, to = 1e4, length.out = 100) ) %>% mutate( sse = map2_dbl(b0, b1, sse) ) %>% arrange(sse) sse_df ``` ``` ## # A tibble: 10,000 × 3 ## b0 b1 sse ## <dbl> <dbl> <dbl> *## 1 0 6869. 8.52e13 ## 2 101. 6869. 8.52e13 ## 3 202. 6869. 8.52e13 ## 4 303. 6869. 8.52e13 ## 5 404. 6869. 8.52e13 ## 6 505. 6869. 8.52e13 ## 7 606. 6869. 8.52e13 ## 8 707. 6869. 8.52e13 ## 9 808. 6869. 8.52e13 ## 10 909. 6869. 8.52e13 ## # ℹ 9,990 more rows ``` --- ## Basics of OLS #### Let's apply the function on large set of values <img src="econometrics1_files/figure-html/unnamed-chunk-13-1.png" width="700px" height="450px" style="display: block; margin: auto;" /> --- ## Basics of OLS Now we have run the function on 10,000 combination, but we are still not sure whether our solution is the best possible, or how far is it from that... -- Fortunately, there is a much more efficient way to determine the line that produces the least squared errors The normal equations: `$$\sum Y_i =n \times \beta_0+\beta_1 \sum X_i$$` `$$\sum X_i Y_i =\beta_0 \sum X_i+\beta_1 \sum X_i^2$$` --- ## Basics of OLS In our case: ```r amsterdam_house_df %>% drop_na(Price, Area) %>% summarise(n = n(), y = sum(Price), x = sum(Area), xz = sum(Price * Area), x2 = sum(Area^2)) ``` ``` ## # A tibble: 1 × 5 ## n y x xz x2 ## <int> <dbl> <dbl> <dbl> <dbl> ## 1 920 572300186 87959 78232126563 11379655 ``` `$$572,300,186 =920 \times \beta_0+\beta_1 \times 87,959$$` `$$78,232,126,563 =\beta_0 \times 87,959+\beta_1 \times 11,379,655$$` --- ## Basics of OLS Of course, there are a simpler solution for fitting a **L**inear **M**odel ```r lm(formula = Price ~ Area, data = amsterdam_house_df) ``` ``` ## ## Call: ## lm(formula = Price ~ Area, data = amsterdam_house_df) ## ## Coefficients: ## (Intercept) Area ## -134910 7918 ``` -- How to interpret the results? 1. A house with one additional m^2 in the area would cost € 7918 more 2. If the area would be zero, the price would be - 134910 (both parts are impossible, this is just a meaningless extrapolation) Now we see the estimated coefficients (beta values), but what else can we extract? --- ## Basics of OLS The old-school (and disadvantageous) method ```r fit <- lm(formula = Price ~ Area, data = amsterdam_house_df) summary(fit) ``` ``` ## ## Call: ## lm(formula = Price ~ Area, data = amsterdam_house_df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1867573 -159054 21513 126639 3220591 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -134909.9 19145.1 -7.047 3.6e-12 *** ## Area 7917.5 172.1 45.994 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 296700 on 918 degrees of freedom ## (4 observations deleted due to missingness) ## Multiple R-squared: 0.6974, Adjusted R-squared: 0.697 ## F-statistic: 2115 on 1 and 918 DF, p-value: < 2.2e-16 ``` --- ## Basics of OLS The tidy method: {broom} ```r library(broom) augment(fit) ``` ``` ## # A tibble: 920 × 9 ## .rownames Price Area .fitted .resid .hat .sigma .cooksd .std.resid ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 685000 64 371811. 313189. 0.00142 296650. 0.000795 1.06 ## 2 2 475000 60 340141. 134859. 0.00151 296797. 0.000157 0.455 ## 3 3 850000 109 728100. 121900. 0.00115 296804. 0.0000971 0.411 ## 4 4 580000 128 878533. -298533. 0.00144 296667. 0.000731 -1.01 ## 5 5 720000 138 957708. -237708. 0.00169 296727. 0.000545 -0.802 ## 6 6 450000 53 284719. 165281. 0.00170 296781. 0.000264 0.558 ## 7 7 450000 87 553914. -103914. 0.00111 296811. 0.0000684 -0.350 ## 8 8 590000 80 498492. 91508. 0.00117 296816. 0.0000557 0.309 ## 9 9 399000 49 253049. 145951. 0.00182 296792. 0.000221 0.492 ## 10 10 300000 33 126368. 173632. 0.00241 296775. 0.000414 0.586 ## # ℹ 910 more rows ``` --- ## Basics of OLS The tidy method: {broom} ```r library(broom) glance(fit) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.697 0.697 2.97e5 2115. 1.71e-240 1 -12897. 25800. 25814. ## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- ## Basics of OLS The tidy method: {broom} ```r library(broom) tidy(fit) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -134910. 19145. -7.05 3.60e- 12 ## 2 Area 7918. 172. 46.0 1.71e-240 ``` .content-box-green[ The estimated coefficients have a standard error! ] -- And yes, they have a **confidence interval** as well. ``` ## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -134910. 19145. -7.05 3.60e- 12 -172483. -97337. ## 2 Area 7918. 172. 46.0 1.71e-240 7580. 8255. ``` --- ## Basics of OLS ```r coef_df <- tidy(fit, conf.int = TRUE, conf.level = .95) ``` <img src="econometrics1_files/figure-html/unnamed-chunk-23-1.png" width="700px" height="450px" style="display: block; margin: auto;" /> --- <img src="econometrics1_files/figure-html/unnamed-chunk-24-1.png" width="700px" height="450px" style="display: block; margin: auto;" /> --- <img src="econometrics1_files/figure-html/unnamed-chunk-25-1.png" width="700px" height="450px" style="display: block; margin: auto;" /> --- ### What is the standard error of the coef? Let's create a function for the data-generating process (DGP), where coefficients can be specified. ```r dgp <- function(b0 = 100, b1 = 20, n = 100) { tibble(x = rnorm(n, sd = 3)) %>% mutate(y = b0 + b1 * x + rnorm(n, sd = 25)) } ``` Let's generate several trajectories with that. ```r tibble(data = rerun(5, dgp())) ``` ``` ## # A tibble: 5 × 1 ## data ## <list> ## 1 <tibble [100 × 2]> ## 2 <tibble [100 × 2]> ## 3 <tibble [100 × 2]> ## 4 <tibble [100 × 2]> ## 5 <tibble [100 × 2]> ``` --- ### What is the standard error of the coef? ```r rerun(1e3, dgp()) %>% # 1000 generated trajectory tibble(data = .) %>% mutate( # the model fit = map(data, function(xx) lm(formula = y ~ x, data = xx)), tidied = map(fit, broom::tidy), # coeffients estimate = map(tidied, pull, 2), # estimation of the coef se = map(tidied, pull, 3), # SE of the coefs b0 = map_dbl(estimate, 1), b1 = map_dbl(estimate, 2), se_b0 = map_dbl(se, 1), se_b1 = map_dbl(se, 2), ) %>% summarise( mean(se_b0), # mean of the estimated SEs sd(b0), # sd of the point estimates mean(se_b1), # mean of the estimated SEs sd(b1), # sd of the point estimates ) ``` ``` ## # A tibble: 1 × 4 ## `mean(se_b0)` `sd(b0)` `mean(se_b1)` `sd(b1)` ## <dbl> <dbl> <dbl> <dbl> ## 1 2.51 2.50 0.845 0.851 ``` --- #### What is the standard error of the coef? <img src="econometrics1_files/figure-html/unnamed-chunk-29-.gif" width="700px" height="450px" style="display: block; margin: auto;" /> The visualization above shows the simulated data points with the previously created `dgp`. It can be seen that due to the random terms, the intercept and the slope will be slightly different for each trajectory. The standard deviations of the estimated parameters are the standard errors. --- ## Basics of OLS In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to? -- `$$H_0: \beta_j=0$$` -- How do we make a decision based on the p-value? -- .pull-left[ <img src="econometrics_files/meme_lowp.JPG" width="250px" height="250px" style="display: block; margin: auto;" /> ] .pull-right[ **If the p-value is lower than the significance level (alpha), then we reject the null hypothesis.** ] -- .blue[And what is the probability of committing a type I error?] -- Alpha. --- ### Type I error in regression Let's run the previously created dgp function, but set b1 to 0. How often is the H0 rejected? ```r tibble(data = rerun(1e3, dgp(b1 = 0))) %>% mutate( fit = map(data, function(xx) lm(formula = y ~ x, data = xx)), tidied = map(fit, broom::tidy), pvalues = map(tidied, pull), b1_p = map_dbl(pvalues, 2) ) %>% summarise(rate_type1 = sum(b1_p < .05) / n()) ``` ``` ## # A tibble: 1 × 1 ## rate_type1 ## <dbl> ## 1 0.053 ``` --- class: inverse, middle, center # Multivariate regression --- ## Multivariate regression - Let's say that Y does not only depend on one variable -- - For instance, price can be determined based on the area size and the number of rooms `$$Y_i = \beta_0 + \beta_1 \times X_{i, 1} + \beta_2 \times X_{i, 2} + \epsilon_i,$$` where `$$X_i,2 = \text{The number of the rooms}$$` - Sure we could try to "find" the optimal coefficients as presented at the bivariate case, but the number of possibilities (infinity) increased a lot... - The mathematical background of solving the minimization problem requires the knowledge of algebra --- ## Visualisation
--- ## Multivariate regression .content-box-red[This is not covered by the course!] `$$\mathbf{y}=\left[\begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array}\right], \quad \mathbf{X}=\left[\begin{array}{ccccc} 1 & x_{11} & x_{21} & \ldots & x_{k 1} \\ 1 & x_{12} & x_{22} & \ldots & x_{k 2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1 n} & x_{2 n} & \ldots & x_{k n} \end{array}\right], \quad \boldsymbol{\beta}=\left[\begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{array}\right], \:\:\:\; \boldsymbol{\varepsilon}=\left[\begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{array}\right]$$` `$$\mathbf{e}^{\mathrm{T}} \mathbf{e}=(\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}})^{\mathrm{T}}(\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}})$$` `$$\hat{\boldsymbol{\beta}}=\left(\mathbf{X}^{\mathrm{T}} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{y}$$` --- ## Multivariate regression .pull-left[ Fortunately, our work did not become more difficult in R. ] .pull-right[ <img src="econometrics_files/meme_byhand.JPG" width="200px" height="200px" style="display: block; margin: auto;" /> ] ```r fit <- lm(formula = Price ~ Area + Room, data = amsterdam_house_df) fit ``` ``` ## ## Call: ## lm(formula = Price ~ Area + Room, data = amsterdam_house_df) ## ## Coefficients: ## (Intercept) Area Room ## -62040 9057 -51009 ``` --- ## Multivariate regression ``` ## ## Call: ## lm(formula = Price ~ Area + Room, data = amsterdam_house_df) ## ## Coefficients: ## (Intercept) Area Room ## -62040 9057 -51009 ``` .blue[Can this result be interpreted as if a house has more rooms, it is cheaper?] -- Of course not. <img src="econometrics_files/meme_opposite.JPG" width="315px" height="250px" style="display: block; margin: auto;" /> --- ## Multivariate regression #### Multicollinearity <img src="econometrics1_files/figure-html/unnamed-chunk-37-1.png" width="700px" height="450px" style="display: block; margin: auto;" /> --- ```r form <- c("Price ~ Area", "Price ~ Area + Room", "Price ~ Room", "Area ~ Room", "Room ~ Area" ) ``` ```r tibble(form) %>% mutate( fit = map(form, lm, data = drop_na( amsterdam_house_df, Price, Area, Room) ), tidied = map(fit, broom::tidy) ) %>% unnest(tidied) %>% select(form, term, estimate) %>% pivot_wider(names_from = term, values_from = estimate) ``` ``` ## # A tibble: 5 × 4 ## form `(Intercept)` Area Room ## <chr> <dbl> <dbl> <dbl> ## 1 Price ~ Area -134910. 7918. NA ## 2 Price ~ Area + Room -62040. 9057. -51009. ## 3 Price ~ Room -140283. NA 213895. ## 4 Area ~ Room -8.64 NA 29.2 ## 5 Room ~ Area 1.43 0.0223 NA ``` --- .pull-left[
] .pull-right[ `$$\text{Total effect} = \text{direct effect} + \text{indirect effect}$$` `$$213895 = 29.2 \times 9057 - 51009$$` ] #### Path analysis 1. .blue[Total effect: ]If we want to buy a house with one additional room, it is expected to cost 213,895 euros more. 2. .blue[Indirect effect: ]A house with 1 additional room would be expected to be 29.2 square meters larger, and the average price per square meter is 9,057 euros. 3. .blue[Direct effect: ] If we compare houses with exactly the same area size, then the one with one extra room would cost 51009 euros less. --- ## Type I error in multivariate model Let's create a function for a new DGP wtih two regressors. ```r dgp_multi <- function(b0 = 100, b1 = 0, b2 = 0, n = 100) { tibble(x1 = rnorm(n, sd = .8), x2 = rnorm(n, sd = 5)) %>% mutate(y = b0 + b1 * x1 + b2 * x2 + rnorm(n)) } dgp_multi() ``` ``` ## # A tibble: 100 × 3 ## x1 x2 y ## <dbl> <dbl> <dbl> ## 1 -0.997 0.0555 99.8 ## 2 -0.0509 4.94 98.0 ## 3 0.813 0.892 100. ## 4 0.311 4.24 100. ## 5 -0.716 4.10 101. ## 6 -0.0961 4.99 100. ## 7 -1.73 3.38 101. ## 8 0.122 -2.19 102. ## 9 0.540 -17.6 99.4 ## 10 0.335 -0.272 98.4 ## # ℹ 90 more rows ``` --- ## Type I error in multivariate model ```r tibble(data = rerun(1e3, dgp_multi())) %>% mutate( fit = map(data, function(xx) lm(y ~ x1 + x2, data = xx)), tidied = map(fit, broom::tidy), pvalues = map(tidied, pull), p_b1 = map_dbl(pvalues, 2), p_b2 = map_dbl(pvalues, 3), error_commited = p_b1 < .05 | p_b2 < .05 ) %>% summarise(sum(error_commited) / n()) ``` ``` ## # A tibble: 1 × 1 ## `sum(error_commited)/n()` ## <dbl> ## 1 0.096 ``` -- We run two test at each model! At each the probability of commiting a type I error is 5%. -- The joint probability to not commit type I error is: `$$1 - .95 \times .95 = 0.0975$$` --- ## Type I error in multivariate model As more and more variables are included in the model, the probability of making a type 1 error increases. -- In the case of many variables, even if the outcome variable is not related to any of them, we will still find a significant parameter. .blue[Solution:] F-test --- ## F-test .pull-left[ `$$F=\frac{(\mathrm{RRSS}-\mathrm{URSS}) / r}{\mathrm{URSS} /(n-k-1)}$$` where URSS = unrestricted residual sum of squares RSSS = restricted residual sum of squares obtained by imposing the restrictions of the hypothesis r = number of restrictions imposed by hypothesis ] .pull-right[ <img src="econometrics_files/meme_test.JPG" width="240px" height="300px" style="display: block; margin: auto;" /> ] Testing whether the explanatory variables jointly explain the variance significantly: `$$\beta_1=\beta_2=\cdots=\beta_k=0$$` --- ## F-test .pull-left[ - The F distribution takes only positive values - Always "greater" alternative - Mainly report only the p-value ] .pull-rigth[ <img src="econometrics1_files/figure-html/unnamed-chunk-44-1.png" width="300px" height="175px" style="display: block; margin: auto;" /> ] ```r broom::glance(fit) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.705 0.704 2.93e5 1096. 7.71e-244 2 -12885. 25778. 25797. ## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- class: center, middle # Thank you for your attention! Slides are available at [www.marcellgranat.com](https://www.marcellgranat.com)