Statistics in R

Readings and class materials for Tuesday, September 12, 2023

Objectives of Statistical Analysis in Economics

Statistical analysis serves as an information-compressing mechanism. The primary objectives can be broadly categorized into:

Predictive Tasks

These tasks aim to forecast future outcomes based on historical data and current conditions.

Descriptive Tasks

These tasks focus on summarizing the main aspects of the data to provide an informative overview.

Exploratory Analysis

This involves identifying patterns, relationships, or anomalies in the data without a prior hypothesis.

Confirmatory Analysis

This involves testing predefined hypotheses to confirm or refute them.

Types of Data

Classification Based on Structure

Unstructured Data

Data that does not have a predefined format or organization.

Structured Data

Data that is organized in a specific manner, often in tabular form.

Types of Structured Data
  • Cross-sectional

  • Time-series

  • Longitudinal

  • Spatial

  • Network

The type of data dictates the statistical tools and techniques that can be employed for analysis.

Attribute Types

Different types of attributes require different statistical techniques for effective analysis.

Attribute type Description Examples Operations
Nominal Nominal values provide only enough information to distinguish one object from another. (=, β‰ ) Zip codes, employee ID numbers, eye color, gender Mode, entropy, contingency correlation, chi2 test
Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) Hardness of minerals, {good, better, best}, grades, street numbers Quantiles, rank correlation
Interval Differences between values are meaningful, i.e., a unit of measurement exists. (+, -) Calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation
Ratio For ratio variables, both differences and ratios are meaningful. (*, /) Temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current Geometric mean, harmonic mean, percent variation

Quality of Data

Common Issues

  • Missing data

  • Outliers

  • Duplication

  • Inconsistent data

Some examples

Let us examine a sample data from the modeldata R package.

# load the data
data(attrition, package = "modeldata") 

tibble(attrition)
# A tibble: 1,470 Γ— 31
     Age Attrition BusinessTravel    DailyRate Department       DistanceFromHome
   <int> <fct>     <fct>                 <int> <fct>                       <int>
 1    41 Yes       Travel_Rarely          1102 Sales                           1
 2    49 No        Travel_Frequently       279 Research_Develo…                8
 3    37 Yes       Travel_Rarely          1373 Research_Develo…                2
 4    33 No        Travel_Frequently      1392 Research_Develo…                3
 5    27 No        Travel_Rarely           591 Research_Develo…                2
 6    32 No        Travel_Frequently      1005 Research_Develo…                2
 7    59 No        Travel_Rarely          1324 Research_Develo…                3
 8    30 No        Travel_Rarely          1358 Research_Develo…               24
 9    38 No        Travel_Frequently       216 Research_Develo…               23
10    36 No        Travel_Rarely          1299 Research_Develo…               27
# β„Ή 1,460 more rows
# β„Ή 25 more variables: Education <ord>, EducationField <fct>,
#   EnvironmentSatisfaction <ord>, Gender <fct>, HourlyRate <int>,
#   JobInvolvement <ord>, JobLevel <int>, JobRole <fct>, JobSatisfaction <ord>,
#   MaritalStatus <fct>, MonthlyIncome <int>, MonthlyRate <int>,
#   NumCompaniesWorked <int>, OverTime <fct>, PercentSalaryHike <int>,
#   PerformanceRating <ord>, RelationshipSatisfaction <ord>, …

After loading data that we would like to work with, it is worthwhile to examine and create descriptive statistics. The summary function exists in almost every language, but in R, I suggest using a dedicated package for this purpose: skmir (although there are countless others available).

attrition |> 
  skimr::skim()
Data summary
Name attrition
Number of rows 1470
Number of columns 31
_______________________
Column type frequency:
factor 15
numeric 16
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 1233, Yes: 237
BusinessTravel 0 1 FALSE 3 Tra: 1043, Tra: 277, Non: 150
Department 0 1 FALSE 3 Res: 961, Sal: 446, Hum: 63
Education 0 1 TRUE 5 Bac: 572, Mas: 398, Col: 282, Bel: 170
EducationField 0 1 FALSE 6 Lif: 606, Med: 464, Mar: 159, Tec: 132
EnvironmentSatisfaction 0 1 TRUE 4 Hig: 453, Ver: 446, Med: 287, Low: 284
Gender 0 1 FALSE 2 Mal: 882, Fem: 588
JobInvolvement 0 1 TRUE 4 Hig: 868, Med: 375, Ver: 144, Low: 83
JobRole 0 1 FALSE 9 Sal: 326, Res: 292, Lab: 259, Man: 145
JobSatisfaction 0 1 TRUE 4 Ver: 459, Hig: 442, Low: 289, Med: 280
MaritalStatus 0 1 FALSE 3 Mar: 673, Sin: 470, Div: 327
OverTime 0 1 FALSE 2 No: 1054, Yes: 416
PerformanceRating 0 1 TRUE 2 Exc: 1244, Out: 226, Low: 0, Goo: 0
RelationshipSatisfaction 0 1 TRUE 4 Hig: 459, Ver: 432, Med: 303, Low: 276
WorkLifeBalance 0 1 TRUE 4 Bet: 893, Goo: 344, Bes: 153, Bad: 80

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 36.92 9.14 18 30 36.0 43.00 60 β–‚β–‡β–‡β–ƒβ–‚
DailyRate 0 1 802.49 403.51 102 465 802.0 1157.00 1499 β–‡β–‡β–‡β–‡β–‡
DistanceFromHome 0 1 9.19 8.11 1 2 7.0 14.00 29 β–‡β–…β–‚β–‚β–‚
HourlyRate 0 1 65.89 20.33 30 48 66.0 83.75 100 β–‡β–‡β–‡β–‡β–‡
JobLevel 0 1 2.06 1.11 1 1 2.0 3.00 5 ▇▇▃▂▁
MonthlyIncome 0 1 6502.93 4707.96 1009 2911 4919.0 8379.00 19999 ▇▅▂▁▂
MonthlyRate 0 1 14313.10 7117.79 2094 8047 14235.5 20461.50 26999 β–‡β–‡β–‡β–‡β–‡
NumCompaniesWorked 0 1 2.69 2.50 0 1 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.21 3.66 11 12 14.0 18.00 25 ▇▅▃▂▁
StockOptionLevel 0 1 0.79 0.85 0 0 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.28 7.78 0 6 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.80 1.29 0 2 3.0 3.00 6 β–‚β–‡β–‡β–‚β–ƒ
YearsAtCompany 0 1 7.01 6.13 0 3 5.0 9.00 40 ▇▂▁▁▁
YearsInCurrentRole 0 1 4.23 3.62 0 2 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.19 3.22 0 0 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.12 3.57 0 2 3.0 7.00 17 ▇▂▅▁▁