Big Data and Data Visualisation
"The application of statistical and mathematical methods to the analysis of economic data, with the purpose of giving empirical content to economic theories and verifying them or refuting them."
"The application of statistical and mathematical methods to the analysis of economic data, with the purpose of giving empirical content to economic theories and verifying them or refuting them."
"The application of statistical and mathematical methods to the analysis of economic data, with the purpose of giving empirical content to economic theories and verifying them or refuting them."
A prominent tool: regression
amsterdam_house_df <- read_rds("https://raw.githubusercontent.com/MarcellGranat/big_data2022/main/econometrics_files/amsterdam_house.rds")# source: https://www.kaggle.com/datasets/thomasnibb/amsterdam-house-price-prediction
## # A tibble: 924 × 7## Address Zip Price Area Room Lon Lat## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Blasiusstraat 8 2, Amsterdam 1091 CR 685000 64 3 4.91 52.4## 2 Kromme Leimuidenstraat 13 H, Amsterdam 1059 EL 475000 60 3 4.85 52.3## 3 Zaaiersweg 11 A, Amsterdam 1097 SM 850000 109 4 4.94 52.3## 4 Tenerifestraat 40, Amsterdam 1060 TH 580000 128 6 4.79 52.3## 5 Winterjanpad 21, Amsterdam 1036 KN 720000 138 5 4.90 52.4## 6 De Wittenkade 134 I, Amsterdam 1051 AM 450000 53 2 4.88 52.4## 7 Pruimenstraat 18 B, Amsterdam 1033 KM 450000 87 3 4.90 52.4## 8 Da Costakade 32 II, Amsterdam 1053 WL 590000 80 2 4.87 52.4## 9 Postjeskade 41 2, Amsterdam 1058 DG 399000 49 3 4.85 52.4## 10 Van Ostadestraat 193 H, Amsterdam 1073 TM 300000 33 2 4.90 52.4## # ℹ 914 more rows
What price do we expect, knowing the area of a given house?
Approach: ???
What price do we expect, knowing the area of a given house?
Approach: Let's fit a straight line.
What parameters do we have to know to draw the line?
Yi=β0+β1×Xi,
where
Yi=The price of a given house
β0=The intercept of the line
β1=The slope of the line
Xi=The area of the given house
What parameters do we have to know to draw the line?
Yi=β0+β1×Xi+ϵi,
where
Yi=The price of a given house
β0=The intercept of the line
β1=The slope of the line
Xi=The area of the given house
ϵi=The error term
What parameters do we have to know to draw the line?
Yi=β0+β1×Xi+ϵi,
where
Yi=The price of a given house
β0=The intercept of the line
β1=The slope of the line
Xi=The area of the given house
ϵi=The error term
What parameters do we have to know to draw the line?
Yi=β0+β1×Xi+ϵi,
where
Yi=The price of a given house
β0=The intercept of the line
β1=The slope of the line
Xi=The area of the given house
ϵi=The error term
Minimize the error?
What parameters do we have to know to draw the line?
Yi=β0+β1×Xi+ϵi,
where
Yi=The price of a given house
β0=The intercept of the line
β1=The slope of the line
Xi=The area of the given house
ϵi=The error term
Minimize the error?
The result would be a line above the points...
What parameters do we have to know to draw the line?
Yi=β0+β1×Xi+ϵi,
where
Yi=The price of a given house
β0=The intercept of the line
β1=The slope of the line
Xi=The area of the given house
ϵi=The error term
Minimize the squared error?
What parameters do we have to know to draw the line?
Yi=β0+β1×Xi+ϵi,
where
Yi=The price of a given house
β0=The intercept of the line
β1=The slope of the line
Xi=The area of the given house
ϵi=The error term
Minimize the squared error?
Exactly, that is what OLS is about. OLS = Ordinary Least Squares
b0 <- 1000b1 <- 200amsterdam_house_df %>% transmute( Price, Area, fit = b0 + Area * b1, e = Price - fit, e2 = e^2 ) %$% sum(e2, na.rm = TRUE)
## [1] 5.91045e+14
sse <- function(b0 = 1000, b1 = 200) { amsterdam_house_df %>% transmute( Price, Area, fit = b0 + Area * b1, e = Price - fit, e2 = e^2 ) %$% sum(e2, na.rm = TRUE)}sse(1000, 200)
## [1] 5.91045e+14
sse_df <- crossing( b0 = seq(from = 0, to = 1e4, length.out = 100), b1 = seq(from = 0, to = 1e4, length.out = 100)) %>% mutate( sse = map2_dbl(b0, b1, sse) ) %>% arrange(sse)sse_df
## # A tibble: 10,000 × 3## b0 b1 sse## <dbl> <dbl> <dbl>## 1 0 6869. 8.52e13## 2 101. 6869. 8.52e13## 3 202. 6869. 8.52e13## 4 303. 6869. 8.52e13## 5 404. 6869. 8.52e13## 6 505. 6869. 8.52e13## 7 606. 6869. 8.52e13## 8 707. 6869. 8.52e13## 9 808. 6869. 8.52e13## 10 909. 6869. 8.52e13## # ℹ 9,990 more rows
Now we have run the function on 10,000 combination, but we are still not sure whether our solution is the best possible, or how far is it from that...
Now we have run the function on 10,000 combination, but we are still not sure whether our solution is the best possible, or how far is it from that...
Fortunately, there is a much more efficient way to determine the line that produces the least squared errors
The normal equations:
∑Yi=n×β0+β1∑Xi
∑XiYi=β0∑Xi+β1∑X2i
In our case:
amsterdam_house_df %>% drop_na(Price, Area) %>% summarise(n = n(), y = sum(Price), x = sum(Area), xz = sum(Price * Area), x2 = sum(Area^2))
## # A tibble: 1 × 5## n y x xz x2## <int> <dbl> <dbl> <dbl> <dbl>## 1 920 572300186 87959 78232126563 11379655
572,300,186=920×β0+β1×87,959
78,232,126,563=β0×87,959+β1×11,379,655
Of course, there are a simpler solution for fitting a Linear Model
lm(formula = Price ~ Area, data = amsterdam_house_df)
## ## Call:## lm(formula = Price ~ Area, data = amsterdam_house_df)## ## Coefficients:## (Intercept) Area ## -134910 7918
Of course, there are a simpler solution for fitting a Linear Model
lm(formula = Price ~ Area, data = amsterdam_house_df)
## ## Call:## lm(formula = Price ~ Area, data = amsterdam_house_df)## ## Coefficients:## (Intercept) Area ## -134910 7918
How to interpret the results?
A house with one additional m^2 in the area would cost € 7918 more
If the area would be zero, the price would be - 134910 (both parts are impossible, this is just a meaningless extrapolation)
Now we see the estimated coefficients (beta values), but what else can we extract?
The old-school (and disadvantageous) method
fit <- lm(formula = Price ~ Area, data = amsterdam_house_df)summary(fit)
## ## Call:## lm(formula = Price ~ Area, data = amsterdam_house_df)## ## Residuals:## Min 1Q Median 3Q Max ## -1867573 -159054 21513 126639 3220591 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -134909.9 19145.1 -7.047 3.6e-12 ***## Area 7917.5 172.1 45.994 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 296700 on 918 degrees of freedom## (4 observations deleted due to missingness)## Multiple R-squared: 0.6974, Adjusted R-squared: 0.697 ## F-statistic: 2115 on 1 and 918 DF, p-value: < 2.2e-16
The tidy method: {broom}
library(broom)augment(fit)
## # A tibble: 920 × 9## .rownames Price Area .fitted .resid .hat .sigma .cooksd .std.resid## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1 685000 64 371811. 313189. 0.00142 296650. 0.000795 1.06 ## 2 2 475000 60 340141. 134859. 0.00151 296797. 0.000157 0.455## 3 3 850000 109 728100. 121900. 0.00115 296804. 0.0000971 0.411## 4 4 580000 128 878533. -298533. 0.00144 296667. 0.000731 -1.01 ## 5 5 720000 138 957708. -237708. 0.00169 296727. 0.000545 -0.802## 6 6 450000 53 284719. 165281. 0.00170 296781. 0.000264 0.558## 7 7 450000 87 553914. -103914. 0.00111 296811. 0.0000684 -0.350## 8 8 590000 80 498492. 91508. 0.00117 296816. 0.0000557 0.309## 9 9 399000 49 253049. 145951. 0.00182 296792. 0.000221 0.492## 10 10 300000 33 126368. 173632. 0.00241 296775. 0.000414 0.586## # ℹ 910 more rows
The tidy method: {broom}
library(broom)glance(fit)
## # A tibble: 1 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.697 0.697 2.97e5 2115. 1.71e-240 1 -12897. 25800. 25814.## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The tidy method: {broom}
library(broom)tidy(fit)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -134910. 19145. -7.05 3.60e- 12## 2 Area 7918. 172. 46.0 1.71e-240
The estimated coefficients have a standard error!
The tidy method: {broom}
library(broom)tidy(fit)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -134910. 19145. -7.05 3.60e- 12## 2 Area 7918. 172. 46.0 1.71e-240
The estimated coefficients have a standard error!
And yes, they have a confidence interval as well.
## # A tibble: 2 × 7## term estimate std.error statistic p.value conf.low conf.high## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -134910. 19145. -7.05 3.60e- 12 -172483. -97337.## 2 Area 7918. 172. 46.0 1.71e-240 7580. 8255.
coef_df <- tidy(fit, conf.int = TRUE, conf.level = .95)
Let's create a function for the data-generating process (DGP), where coefficients can be specified.
dgp <- function(b0 = 100, b1 = 20, n = 100) { tibble(x = rnorm(n, sd = 3)) %>% mutate(y = b0 + b1 * x + rnorm(n, sd = 25))}
Let's generate several trajectories with that.
tibble(data = rerun(5, dgp()))
## # A tibble: 5 × 1## data ## <list> ## 1 <tibble [100 × 2]>## 2 <tibble [100 × 2]>## 3 <tibble [100 × 2]>## 4 <tibble [100 × 2]>## 5 <tibble [100 × 2]>
rerun(1e3, dgp()) %>% # 1000 generated trajectory tibble(data = .) %>% mutate( # the model fit = map(data, function(xx) lm(formula = y ~ x, data = xx)), tidied = map(fit, broom::tidy), # coeffients estimate = map(tidied, pull, 2), # estimation of the coef se = map(tidied, pull, 3), # SE of the coefs b0 = map_dbl(estimate, 1), b1 = map_dbl(estimate, 2), se_b0 = map_dbl(se, 1), se_b1 = map_dbl(se, 2), ) %>% summarise( mean(se_b0), # mean of the estimated SEs sd(b0), # sd of the point estimates mean(se_b1), # mean of the estimated SEs sd(b1), # sd of the point estimates )
## # A tibble: 1 × 4## `mean(se_b0)` `sd(b0)` `mean(se_b1)` `sd(b1)`## <dbl> <dbl> <dbl> <dbl>## 1 2.51 2.50 0.845 0.851
The visualization above shows the simulated data points with the previously created dgp
. It can be seen that due to the random terms, the intercept and the slope will be slightly different for each trajectory. The standard deviations of the estimated parameters are the standard errors.
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
H0:βj=0
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
H0:βj=0
How do we make a decision based on the p-value?
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
H0:βj=0
How do we make a decision based on the p-value?
If the p-value is lower than the significance level (alpha), then we reject the null hypothesis.
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
H0:βj=0
How do we make a decision based on the p-value?
If the p-value is lower than the significance level (alpha), then we reject the null hypothesis.
And what is the probability of committing a type I error?
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
H0:βj=0
How do we make a decision based on the p-value?
If the p-value is lower than the significance level (alpha), then we reject the null hypothesis.
And what is the probability of committing a type I error?
Alpha.
Let's run the previously created dgp function, but set b1 to 0. How often is the H0 rejected?
tibble(data = rerun(1e3, dgp(b1 = 0))) %>% mutate( fit = map(data, function(xx) lm(formula = y ~ x, data = xx)), tidied = map(fit, broom::tidy), pvalues = map(tidied, pull), b1_p = map_dbl(pvalues, 2) ) %>% summarise(rate_type1 = sum(b1_p < .05) / n())
## # A tibble: 1 × 1## rate_type1## <dbl>## 1 0.053
Let's say that Y does not only depend on one variable
For instance, price can be determined based on the area size and the number of rooms
Yi=β0+β1×Xi,1+β2×Xi,2+ϵi, where
Xi,2=The number of the rooms
Sure we could try to "find" the optimal coefficients as presented at the bivariate case, but the number of possibilities (infinity) increased a lot...
The mathematical background of solving the minimization problem requires the knowledge of algebra
This is not covered by the course!
\mathbf{y}=\left[\begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array}\right], \quad \mathbf{X}=\left[\begin{array}{ccccc} 1 & x_{11} & x_{21} & \ldots & x_{k 1} \\ 1 & x_{12} & x_{22} & \ldots & x_{k 2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1 n} & x_{2 n} & \ldots & x_{k n} \end{array}\right], \quad \boldsymbol{\beta}=\left[\begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{array}\right], \:\:\:\; \boldsymbol{\varepsilon}=\left[\begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{array}\right]
\mathbf{e}^{\mathrm{T}} \mathbf{e}=(\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}})^{\mathrm{T}}(\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}})
\hat{\boldsymbol{\beta}}=\left(\mathbf{X}^{\mathrm{T}} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{y}
Fortunately, our work did not become more difficult in R.
fit <- lm(formula = Price ~ Area + Room, data = amsterdam_house_df)fit
## ## Call:## lm(formula = Price ~ Area + Room, data = amsterdam_house_df)## ## Coefficients:## (Intercept) Area Room ## -62040 9057 -51009
## ## Call:## lm(formula = Price ~ Area + Room, data = amsterdam_house_df)## ## Coefficients:## (Intercept) Area Room ## -62040 9057 -51009
Can this result be interpreted as if a house has more rooms, it is cheaper?
## ## Call:## lm(formula = Price ~ Area + Room, data = amsterdam_house_df)## ## Coefficients:## (Intercept) Area Room ## -62040 9057 -51009
Can this result be interpreted as if a house has more rooms, it is cheaper?
Of course not.
form <- c("Price ~ Area", "Price ~ Area + Room", "Price ~ Room", "Area ~ Room", "Room ~ Area")
tibble(form) %>% mutate( fit = map(form, lm, data = drop_na( amsterdam_house_df, Price, Area, Room) ), tidied = map(fit, broom::tidy) ) %>% unnest(tidied) %>% select(form, term, estimate) %>% pivot_wider(names_from = term, values_from = estimate)
## # A tibble: 5 × 4## form `(Intercept)` Area Room## <chr> <dbl> <dbl> <dbl>## 1 Price ~ Area -134910. 7918. NA ## 2 Price ~ Area + Room -62040. 9057. -51009. ## 3 Price ~ Room -140283. NA 213895. ## 4 Area ~ Room -8.64 NA 29.2## 5 Room ~ Area 1.43 0.0223 NA
\text{Total effect} = \text{direct effect} + \text{indirect effect} 213895 = 29.2 \times 9057 - 51009
Total effect: If we want to buy a house with one additional room, it is expected to cost 213,895 euros more.
Indirect effect: A house with 1 additional room would be expected to be 29.2 square meters larger, and the average price per square meter is 9,057 euros.
Direct effect: If we compare houses with exactly the same area size, then the one with one extra room would cost 51009 euros less.
Let's create a function for a new DGP wtih two regressors.
dgp_multi <- function(b0 = 100, b1 = 0, b2 = 0, n = 100) { tibble(x1 = rnorm(n, sd = .8), x2 = rnorm(n, sd = 5)) %>% mutate(y = b0 + b1 * x1 + b2 * x2 + rnorm(n)) }dgp_multi()
## # A tibble: 100 × 3## x1 x2 y## <dbl> <dbl> <dbl>## 1 -0.997 0.0555 99.8## 2 -0.0509 4.94 98.0## 3 0.813 0.892 100. ## 4 0.311 4.24 100. ## 5 -0.716 4.10 101. ## 6 -0.0961 4.99 100. ## 7 -1.73 3.38 101. ## 8 0.122 -2.19 102. ## 9 0.540 -17.6 99.4## 10 0.335 -0.272 98.4## # ℹ 90 more rows
tibble(data = rerun(1e3, dgp_multi())) %>% mutate( fit = map(data, function(xx) lm(y ~ x1 + x2, data = xx)), tidied = map(fit, broom::tidy), pvalues = map(tidied, pull), p_b1 = map_dbl(pvalues, 2), p_b2 = map_dbl(pvalues, 3), error_commited = p_b1 < .05 | p_b2 < .05 ) %>% summarise(sum(error_commited) / n())
## # A tibble: 1 × 1## `sum(error_commited)/n()`## <dbl>## 1 0.096
tibble(data = rerun(1e3, dgp_multi())) %>% mutate( fit = map(data, function(xx) lm(y ~ x1 + x2, data = xx)), tidied = map(fit, broom::tidy), pvalues = map(tidied, pull), p_b1 = map_dbl(pvalues, 2), p_b2 = map_dbl(pvalues, 3), error_commited = p_b1 < .05 | p_b2 < .05 ) %>% summarise(sum(error_commited) / n())
## # A tibble: 1 × 1## `sum(error_commited)/n()`## <dbl>## 1 0.096
We run two test at each model!
At each the probability of commiting a type I error is 5%.
tibble(data = rerun(1e3, dgp_multi())) %>% mutate( fit = map(data, function(xx) lm(y ~ x1 + x2, data = xx)), tidied = map(fit, broom::tidy), pvalues = map(tidied, pull), p_b1 = map_dbl(pvalues, 2), p_b2 = map_dbl(pvalues, 3), error_commited = p_b1 < .05 | p_b2 < .05 ) %>% summarise(sum(error_commited) / n())
## # A tibble: 1 × 1## `sum(error_commited)/n()`## <dbl>## 1 0.096
We run two test at each model!
At each the probability of commiting a type I error is 5%.
The joint probability to not commit type I error is:
1 - .95 \times .95 = 0.0975
As more and more variables are included in the model, the probability of making a type 1 error increases.
As more and more variables are included in the model, the probability of making a type 1 error increases.
In the case of many variables, even if the outcome variable is not related to any of them, we will still find a significant parameter.
Solution: F-test
F=\frac{(\mathrm{RRSS}-\mathrm{URSS}) / r}{\mathrm{URSS} /(n-k-1)}
where
URSS = unrestricted residual sum of squares
RSSS = restricted residual sum of squares obtained by imposing the restrictions of the hypothesis
r = number of restrictions imposed by hypothesis
Testing whether the explanatory variables jointly explain the variance significantly:
\beta_1=\beta_2=\cdots=\beta_k=0
The F distribution takes only positive values
Always "greater" alternative
Mainly report only the p-value
broom::glance(fit)
## # A tibble: 1 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.705 0.704 2.93e5 1096. 7.71e-244 2 -12885. 25778. 25797.## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Big Data and Data Visualisation
"The application of statistical and mathematical methods to the analysis of economic data, with the purpose of giving empirical content to economic theories and verifying them or refuting them."
"The application of statistical and mathematical methods to the analysis of economic data, with the purpose of giving empirical content to economic theories and verifying them or refuting them."
"The application of statistical and mathematical methods to the analysis of economic data, with the purpose of giving empirical content to economic theories and verifying them or refuting them."
A prominent tool: regression
amsterdam_house_df <- read_rds("https://raw.githubusercontent.com/MarcellGranat/big_data2022/main/econometrics_files/amsterdam_house.rds")# source: https://www.kaggle.com/datasets/thomasnibb/amsterdam-house-price-prediction
## # A tibble: 924 × 7## Address Zip Price Area Room Lon Lat## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Blasiusstraat 8 2, Amsterdam 1091 CR 685000 64 3 4.91 52.4## 2 Kromme Leimuidenstraat 13 H, Amsterdam 1059 EL 475000 60 3 4.85 52.3## 3 Zaaiersweg 11 A, Amsterdam 1097 SM 850000 109 4 4.94 52.3## 4 Tenerifestraat 40, Amsterdam 1060 TH 580000 128 6 4.79 52.3## 5 Winterjanpad 21, Amsterdam 1036 KN 720000 138 5 4.90 52.4## 6 De Wittenkade 134 I, Amsterdam 1051 AM 450000 53 2 4.88 52.4## 7 Pruimenstraat 18 B, Amsterdam 1033 KM 450000 87 3 4.90 52.4## 8 Da Costakade 32 II, Amsterdam 1053 WL 590000 80 2 4.87 52.4## 9 Postjeskade 41 2, Amsterdam 1058 DG 399000 49 3 4.85 52.4## 10 Van Ostadestraat 193 H, Amsterdam 1073 TM 300000 33 2 4.90 52.4## # ℹ 914 more rows
What price do we expect, knowing the area of a given house?
Approach: ???
What price do we expect, knowing the area of a given house?
Approach: Let's fit a straight line.
What parameters do we have to know to draw the line?
Y_i = \beta_0 + \beta_1 \times X_i,
where
Y_i = \text{The price of a given house}
\beta_0 = \text{The intercept of the line}
\beta_1 = \text{The slope of the line}
X_i = \text{The area of the given house}
What parameters do we have to know to draw the line?
Y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i,
where
Y_i = \text{The price of a given house}
\beta_0 = \text{The intercept of the line}
\beta_1 = \text{The slope of the line}
X_i = \text{The area of the given house}
\epsilon_i = \text{The error term}
What parameters do we have to know to draw the line?
Y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i,
where
Y_i = \text{The price of a given house}
\beta_0 = \text{The intercept of the line}
\beta_1 = \text{The slope of the line}
X_i = \text{The area of the given house}
\epsilon_i = \text{The error term}
What parameters do we have to know to draw the line?
Y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i,
where
Y_i = \text{The price of a given house}
\beta_0 = \text{The intercept of the line}
\beta_1 = \text{The slope of the line}
X_i = \text{The area of the given house}
\epsilon_i = \text{The error term}
Minimize the error?
What parameters do we have to know to draw the line?
Y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i,
where
Y_i = \text{The price of a given house}
\beta_0 = \text{The intercept of the line}
\beta_1 = \text{The slope of the line}
X_i = \text{The area of the given house}
\epsilon_i = \text{The error term}
Minimize the error?
The result would be a line above the points...
What parameters do we have to know to draw the line?
Y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i,
where
Y_i = \text{The price of a given house}
\beta_0 = \text{The intercept of the line}
\beta_1 = \text{The slope of the line}
X_i = \text{The area of the given house}
\epsilon_i = \text{The error term}
Minimize the squared error?
What parameters do we have to know to draw the line?
Y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i,
where
Y_i = \text{The price of a given house}
\beta_0 = \text{The intercept of the line}
\beta_1 = \text{The slope of the line}
X_i = \text{The area of the given house}
\epsilon_i = \text{The error term}
Minimize the squared error?
Exactly, that is what OLS is about. OLS = Ordinary Least Squares
b0 <- 1000b1 <- 200amsterdam_house_df %>% transmute( Price, Area, fit = b0 + Area * b1, e = Price - fit, e2 = e^2 ) %$% sum(e2, na.rm = TRUE)
## [1] 5.91045e+14
sse <- function(b0 = 1000, b1 = 200) { amsterdam_house_df %>% transmute( Price, Area, fit = b0 + Area * b1, e = Price - fit, e2 = e^2 ) %$% sum(e2, na.rm = TRUE)}sse(1000, 200)
## [1] 5.91045e+14
sse_df <- crossing( b0 = seq(from = 0, to = 1e4, length.out = 100), b1 = seq(from = 0, to = 1e4, length.out = 100)) %>% mutate( sse = map2_dbl(b0, b1, sse) ) %>% arrange(sse)sse_df
## # A tibble: 10,000 × 3## b0 b1 sse## <dbl> <dbl> <dbl>## 1 0 6869. 8.52e13## 2 101. 6869. 8.52e13## 3 202. 6869. 8.52e13## 4 303. 6869. 8.52e13## 5 404. 6869. 8.52e13## 6 505. 6869. 8.52e13## 7 606. 6869. 8.52e13## 8 707. 6869. 8.52e13## 9 808. 6869. 8.52e13## 10 909. 6869. 8.52e13## # ℹ 9,990 more rows
Now we have run the function on 10,000 combination, but we are still not sure whether our solution is the best possible, or how far is it from that...
Now we have run the function on 10,000 combination, but we are still not sure whether our solution is the best possible, or how far is it from that...
Fortunately, there is a much more efficient way to determine the line that produces the least squared errors
The normal equations:
\sum Y_i =n \times \beta_0+\beta_1 \sum X_i
\sum X_i Y_i =\beta_0 \sum X_i+\beta_1 \sum X_i^2
In our case:
amsterdam_house_df %>% drop_na(Price, Area) %>% summarise(n = n(), y = sum(Price), x = sum(Area), xz = sum(Price * Area), x2 = sum(Area^2))
## # A tibble: 1 × 5## n y x xz x2## <int> <dbl> <dbl> <dbl> <dbl>## 1 920 572300186 87959 78232126563 11379655
572,300,186 =920 \times \beta_0+\beta_1 \times 87,959
78,232,126,563 =\beta_0 \times 87,959+\beta_1 \times 11,379,655
Of course, there are a simpler solution for fitting a Linear Model
lm(formula = Price ~ Area, data = amsterdam_house_df)
## ## Call:## lm(formula = Price ~ Area, data = amsterdam_house_df)## ## Coefficients:## (Intercept) Area ## -134910 7918
Of course, there are a simpler solution for fitting a Linear Model
lm(formula = Price ~ Area, data = amsterdam_house_df)
## ## Call:## lm(formula = Price ~ Area, data = amsterdam_house_df)## ## Coefficients:## (Intercept) Area ## -134910 7918
How to interpret the results?
A house with one additional m^2 in the area would cost € 7918 more
If the area would be zero, the price would be - 134910 (both parts are impossible, this is just a meaningless extrapolation)
Now we see the estimated coefficients (beta values), but what else can we extract?
The old-school (and disadvantageous) method
fit <- lm(formula = Price ~ Area, data = amsterdam_house_df)summary(fit)
## ## Call:## lm(formula = Price ~ Area, data = amsterdam_house_df)## ## Residuals:## Min 1Q Median 3Q Max ## -1867573 -159054 21513 126639 3220591 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -134909.9 19145.1 -7.047 3.6e-12 ***## Area 7917.5 172.1 45.994 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 296700 on 918 degrees of freedom## (4 observations deleted due to missingness)## Multiple R-squared: 0.6974, Adjusted R-squared: 0.697 ## F-statistic: 2115 on 1 and 918 DF, p-value: < 2.2e-16
The tidy method: {broom}
library(broom)augment(fit)
## # A tibble: 920 × 9## .rownames Price Area .fitted .resid .hat .sigma .cooksd .std.resid## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1 685000 64 371811. 313189. 0.00142 296650. 0.000795 1.06 ## 2 2 475000 60 340141. 134859. 0.00151 296797. 0.000157 0.455## 3 3 850000 109 728100. 121900. 0.00115 296804. 0.0000971 0.411## 4 4 580000 128 878533. -298533. 0.00144 296667. 0.000731 -1.01 ## 5 5 720000 138 957708. -237708. 0.00169 296727. 0.000545 -0.802## 6 6 450000 53 284719. 165281. 0.00170 296781. 0.000264 0.558## 7 7 450000 87 553914. -103914. 0.00111 296811. 0.0000684 -0.350## 8 8 590000 80 498492. 91508. 0.00117 296816. 0.0000557 0.309## 9 9 399000 49 253049. 145951. 0.00182 296792. 0.000221 0.492## 10 10 300000 33 126368. 173632. 0.00241 296775. 0.000414 0.586## # ℹ 910 more rows
The tidy method: {broom}
library(broom)glance(fit)
## # A tibble: 1 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.697 0.697 2.97e5 2115. 1.71e-240 1 -12897. 25800. 25814.## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The tidy method: {broom}
library(broom)tidy(fit)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -134910. 19145. -7.05 3.60e- 12## 2 Area 7918. 172. 46.0 1.71e-240
The estimated coefficients have a standard error!
The tidy method: {broom}
library(broom)tidy(fit)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -134910. 19145. -7.05 3.60e- 12## 2 Area 7918. 172. 46.0 1.71e-240
The estimated coefficients have a standard error!
And yes, they have a confidence interval as well.
## # A tibble: 2 × 7## term estimate std.error statistic p.value conf.low conf.high## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -134910. 19145. -7.05 3.60e- 12 -172483. -97337.## 2 Area 7918. 172. 46.0 1.71e-240 7580. 8255.
coef_df <- tidy(fit, conf.int = TRUE, conf.level = .95)
Let's create a function for the data-generating process (DGP), where coefficients can be specified.
dgp <- function(b0 = 100, b1 = 20, n = 100) { tibble(x = rnorm(n, sd = 3)) %>% mutate(y = b0 + b1 * x + rnorm(n, sd = 25))}
Let's generate several trajectories with that.
tibble(data = rerun(5, dgp()))
## # A tibble: 5 × 1## data ## <list> ## 1 <tibble [100 × 2]>## 2 <tibble [100 × 2]>## 3 <tibble [100 × 2]>## 4 <tibble [100 × 2]>## 5 <tibble [100 × 2]>
rerun(1e3, dgp()) %>% # 1000 generated trajectory tibble(data = .) %>% mutate( # the model fit = map(data, function(xx) lm(formula = y ~ x, data = xx)), tidied = map(fit, broom::tidy), # coeffients estimate = map(tidied, pull, 2), # estimation of the coef se = map(tidied, pull, 3), # SE of the coefs b0 = map_dbl(estimate, 1), b1 = map_dbl(estimate, 2), se_b0 = map_dbl(se, 1), se_b1 = map_dbl(se, 2), ) %>% summarise( mean(se_b0), # mean of the estimated SEs sd(b0), # sd of the point estimates mean(se_b1), # mean of the estimated SEs sd(b1), # sd of the point estimates )
## # A tibble: 1 × 4## `mean(se_b0)` `sd(b0)` `mean(se_b1)` `sd(b1)`## <dbl> <dbl> <dbl> <dbl>## 1 2.51 2.50 0.845 0.851
The visualization above shows the simulated data points with the previously created dgp
. It can be seen that due to the random terms, the intercept and the slope will be slightly different for each trajectory. The standard deviations of the estimated parameters are the standard errors.
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
H_0: \beta_j=0
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
H_0: \beta_j=0
How do we make a decision based on the p-value?
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
H_0: \beta_j=0
How do we make a decision based on the p-value?
If the p-value is lower than the significance level (alpha), then we reject the null hypothesis.
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
H_0: \beta_j=0
How do we make a decision based on the p-value?
If the p-value is lower than the significance level (alpha), then we reject the null hypothesis.
And what is the probability of committing a type I error?
In the table containing the coefficients, a test statistic and a p-value are listed next to each term What hypothesis do they belong to?
H_0: \beta_j=0
How do we make a decision based on the p-value?
If the p-value is lower than the significance level (alpha), then we reject the null hypothesis.
And what is the probability of committing a type I error?
Alpha.
Let's run the previously created dgp function, but set b1 to 0. How often is the H0 rejected?
tibble(data = rerun(1e3, dgp(b1 = 0))) %>% mutate( fit = map(data, function(xx) lm(formula = y ~ x, data = xx)), tidied = map(fit, broom::tidy), pvalues = map(tidied, pull), b1_p = map_dbl(pvalues, 2) ) %>% summarise(rate_type1 = sum(b1_p < .05) / n())
## # A tibble: 1 × 1## rate_type1## <dbl>## 1 0.053
Let's say that Y does not only depend on one variable
For instance, price can be determined based on the area size and the number of rooms
Y_i = \beta_0 + \beta_1 \times X_{i, 1} + \beta_2 \times X_{i, 2} + \epsilon_i, where
X_i,2 = \text{The number of the rooms}
Sure we could try to "find" the optimal coefficients as presented at the bivariate case, but the number of possibilities (infinity) increased a lot...
The mathematical background of solving the minimization problem requires the knowledge of algebra
This is not covered by the course!
\mathbf{y}=\left[\begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array}\right], \quad \mathbf{X}=\left[\begin{array}{ccccc} 1 & x_{11} & x_{21} & \ldots & x_{k 1} \\ 1 & x_{12} & x_{22} & \ldots & x_{k 2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1 n} & x_{2 n} & \ldots & x_{k n} \end{array}\right], \quad \boldsymbol{\beta}=\left[\begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{array}\right], \:\:\:\; \boldsymbol{\varepsilon}=\left[\begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{array}\right]
\mathbf{e}^{\mathrm{T}} \mathbf{e}=(\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}})^{\mathrm{T}}(\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}})
\hat{\boldsymbol{\beta}}=\left(\mathbf{X}^{\mathrm{T}} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{y}
Fortunately, our work did not become more difficult in R.
fit <- lm(formula = Price ~ Area + Room, data = amsterdam_house_df)fit
## ## Call:## lm(formula = Price ~ Area + Room, data = amsterdam_house_df)## ## Coefficients:## (Intercept) Area Room ## -62040 9057 -51009
## ## Call:## lm(formula = Price ~ Area + Room, data = amsterdam_house_df)## ## Coefficients:## (Intercept) Area Room ## -62040 9057 -51009
Can this result be interpreted as if a house has more rooms, it is cheaper?
## ## Call:## lm(formula = Price ~ Area + Room, data = amsterdam_house_df)## ## Coefficients:## (Intercept) Area Room ## -62040 9057 -51009
Can this result be interpreted as if a house has more rooms, it is cheaper?
Of course not.
form <- c("Price ~ Area", "Price ~ Area + Room", "Price ~ Room", "Area ~ Room", "Room ~ Area")
tibble(form) %>% mutate( fit = map(form, lm, data = drop_na( amsterdam_house_df, Price, Area, Room) ), tidied = map(fit, broom::tidy) ) %>% unnest(tidied) %>% select(form, term, estimate) %>% pivot_wider(names_from = term, values_from = estimate)
## # A tibble: 5 × 4## form `(Intercept)` Area Room## <chr> <dbl> <dbl> <dbl>## 1 Price ~ Area -134910. 7918. NA ## 2 Price ~ Area + Room -62040. 9057. -51009. ## 3 Price ~ Room -140283. NA 213895. ## 4 Area ~ Room -8.64 NA 29.2## 5 Room ~ Area 1.43 0.0223 NA
\text{Total effect} = \text{direct effect} + \text{indirect effect} 213895 = 29.2 \times 9057 - 51009
Total effect: If we want to buy a house with one additional room, it is expected to cost 213,895 euros more.
Indirect effect: A house with 1 additional room would be expected to be 29.2 square meters larger, and the average price per square meter is 9,057 euros.
Direct effect: If we compare houses with exactly the same area size, then the one with one extra room would cost 51009 euros less.
Let's create a function for a new DGP wtih two regressors.
dgp_multi <- function(b0 = 100, b1 = 0, b2 = 0, n = 100) { tibble(x1 = rnorm(n, sd = .8), x2 = rnorm(n, sd = 5)) %>% mutate(y = b0 + b1 * x1 + b2 * x2 + rnorm(n)) }dgp_multi()
## # A tibble: 100 × 3## x1 x2 y## <dbl> <dbl> <dbl>## 1 -0.997 0.0555 99.8## 2 -0.0509 4.94 98.0## 3 0.813 0.892 100. ## 4 0.311 4.24 100. ## 5 -0.716 4.10 101. ## 6 -0.0961 4.99 100. ## 7 -1.73 3.38 101. ## 8 0.122 -2.19 102. ## 9 0.540 -17.6 99.4## 10 0.335 -0.272 98.4## # ℹ 90 more rows
tibble(data = rerun(1e3, dgp_multi())) %>% mutate( fit = map(data, function(xx) lm(y ~ x1 + x2, data = xx)), tidied = map(fit, broom::tidy), pvalues = map(tidied, pull), p_b1 = map_dbl(pvalues, 2), p_b2 = map_dbl(pvalues, 3), error_commited = p_b1 < .05 | p_b2 < .05 ) %>% summarise(sum(error_commited) / n())
## # A tibble: 1 × 1## `sum(error_commited)/n()`## <dbl>## 1 0.096
tibble(data = rerun(1e3, dgp_multi())) %>% mutate( fit = map(data, function(xx) lm(y ~ x1 + x2, data = xx)), tidied = map(fit, broom::tidy), pvalues = map(tidied, pull), p_b1 = map_dbl(pvalues, 2), p_b2 = map_dbl(pvalues, 3), error_commited = p_b1 < .05 | p_b2 < .05 ) %>% summarise(sum(error_commited) / n())
## # A tibble: 1 × 1## `sum(error_commited)/n()`## <dbl>## 1 0.096
We run two test at each model!
At each the probability of commiting a type I error is 5%.
tibble(data = rerun(1e3, dgp_multi())) %>% mutate( fit = map(data, function(xx) lm(y ~ x1 + x2, data = xx)), tidied = map(fit, broom::tidy), pvalues = map(tidied, pull), p_b1 = map_dbl(pvalues, 2), p_b2 = map_dbl(pvalues, 3), error_commited = p_b1 < .05 | p_b2 < .05 ) %>% summarise(sum(error_commited) / n())
## # A tibble: 1 × 1## `sum(error_commited)/n()`## <dbl>## 1 0.096
We run two test at each model!
At each the probability of commiting a type I error is 5%.
The joint probability to not commit type I error is:
1 - .95 \times .95 = 0.0975
As more and more variables are included in the model, the probability of making a type 1 error increases.
As more and more variables are included in the model, the probability of making a type 1 error increases.
In the case of many variables, even if the outcome variable is not related to any of them, we will still find a significant parameter.
Solution: F-test
F=\frac{(\mathrm{RRSS}-\mathrm{URSS}) / r}{\mathrm{URSS} /(n-k-1)}
where
URSS = unrestricted residual sum of squares
RSSS = restricted residual sum of squares obtained by imposing the restrictions of the hypothesis
r = number of restrictions imposed by hypothesis
Testing whether the explanatory variables jointly explain the variance significantly:
\beta_1=\beta_2=\cdots=\beta_k=0
The F distribution takes only positive values
Always "greater" alternative
Mainly report only the p-value
broom::glance(fit)
## # A tibble: 1 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.705 0.704 2.93e5 1096. 7.71e-244 2 -12885. 25778. 25797.## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>