Daily inflation for used vehicles in Hungary

Mixed clustering methods to produce the representative consumer basket

Marcell Granat, Peter Vekas

14th May, 2023



  • Modern technologies allow for better statistical analysis of economic indicators like inflation (Cavallo and Rigobon 2016)

  • The car is a significant part of the consumption basket, accounting for 5% in Hungary (Hungarian Central Statistical Office 2022)

  • This research aims to provide a model framework for a daily inflation index for the car market

Where are we?

  • This is the second research project focusing on used cars

  • The first one is current under review at Swiss Journal of Economics and Statistics (Q1)

Challenges in Creating a Daily Inflation Index for Cars

  • Cars are heterogeneous. They have countless properties that vary (Similar to housing)

Important Criteria for an Inflation Index

Based on Mark and Goldberg (1984, 31):

  1. Conceptually sound

  2. Administratively simple

  3. Reasonably stable and not overly dependent on changes in transactions or samples

We defined one extra for a daily index

  1. Must be robust to new observations

Methodology for Creating the Index

  1. Use clustering procedures to create a representative basket of cars

  2. Build multiple models (OLS, Random forest, …) to estimate prices of vehicles in the basket

  3. Apply adjustments for weekly and monthly seasonality (WIP)


Downloaded data of cars available on the website hasznaltauto.hu daily for 10 months starting in May of 2022


Grouped vehicles by the day they were sold, assuming that if a car is no longer available on the website, then it has been sold for the last indicated price

Advantages of Online-based Inflation Calculation

Bargaining may cause some inaccuracy in our index since we derive it from offer prices. But if this amount is time-invariant, then the comparison of the price level between periods (inflation) may remain accurate.

Cleaning Steps and Variables

#> # A tibble: 470,298 × 121
#>   id     date        price brand allapot kivitel ajtok_szama
#>   <chr>  <date>      <int> <fct> <fct>   <fct>   <fct>      
#> 1 10020… 2021-11-05 1.06e7 bmw   Normál  Városi… 5          
#> 2 10058… 2021-12-31 1.00e6 opel  Normál  Kombi   5          
#> 3 10126… 2021-07-13 1.30e6 opel  Kitűnő  Egyterű 5          
#> 4 10316… 2021-07-13 8.99e5 citr… Normál  Egyterű 5          
#> # ℹ 470,294 more rows
#> # ℹ 114 more variables: szin <fct>, karpit_szine_1 <fct>,
#> #   uzemanyag <fct>, henger_elrendezes <fct>, hajtas <fct>,
#> #   sebessegvalto_fajtaja <fct>, okmanyok_jellege <fct>,
#> #   muszaki_vizsga_ervenyes <fct>, klima_fajtaja <fct>,
#> #   karpit_szine_2 <fct>, teli_gumi_meret <fct>,
#> #   teto <fct>, hatso_nyari_gumi_meret <fct>, …
  • Variables with a high number of possible values but do not occur daily were classified into the “other” category (e.g. rare brands)

Cleaning Steps and Variables

  • Variables with a high number of possible values but do not occur daily were classified into the “other” category (e.g. rare brands)

  • Offer prices show positive skew, so predictive models were built in a log-lin form


  • Constructing a price index for heterogeneous goods is challenging
  • Numerous methodologies to control for quality changes, but no generally appropriate procedure
  • When creating a daily frequency index, the selection of a quality-adjustment methodology is crucial

Two Main Approaches for Heterogeneous Goods1

Repeat sales

Uses a fraction of data that are sold in both compared periods.

Hedonic regression

Model contains periodic dummy variables for quality-adjusted price change

Issues with These Approaches

Repeat sales:

  • Fraction of repeat sales are close to zero.

Hedonic regression:

  • Model needs to be re-estimated if new observations are added (the model and former estimations change)

  • Assumes utility of certain characteristics is constant over time, which is not true for the car market (Requena-Silvente and Walker 2006)

Our Approach

  1. Forming submarkets that contain homogeneous observations (clustering)

  2. Building predictive regression models using the sold cars for each day

  3. Derive the price level based on the predicted price of the cluster centres’

Criteria for Model Selection

  • Good predictive ability (average \(R^2\) on the daily data)
  • Reasonably low resulted wiggliness (sum of squares of the second derivative of the price level)


Number of clusters

Issue #1

Determining the ideal quantity of clusters in a dataset is ambiguous

Issue #2

  • Time complexity increases proportionally with number of clusters (K-proto)

  • Estimation time increased by 3 hours for each additional cluster

Hierarchical clustering

  1. Each data point is initially considered as an individual cluster

  2. Iteratively merges the closest clusters based on a chosen distance metric (Gower-distance)

  1. Regardless of the number of clusters, it builds on the same distance matrix

  2. Iterative merging of clusters is faster compared to other algorithms

Hierarchical clustering


  1. Regardless of the number of clusters, it builds on the same distance matrix

  2. Iterative merging of clusters is faster compared to other algorithms


  1. High memory demand due to the large computed distance matrix, thus we take a sample of 5,000

  2. Only 5 of the 30 indices suggested by Charrad et al. (2014) can be used for mixed type data

Hierarchical clustering

Optimal number of clusters determined by:

  1. Minimal C-index

  2. Maximal Dunn-index

  3. Maximal Frey-index, but below 1

  4. Minimal McClain index

  5. Maximal Silhouette

Hierarchical clustering

Cluster centres

After we defined the optimal number of clusters we apply the K-proto algorithm to the full dataset.


The centres of clusters are existing points and thus they are easy to present what elements are in the produced representative basket.

Cluster centres

Predicting the price level

Predicting the price level

We apply multiple regression models to predict the price level.

  1. OLS
  2. LASSO
  3. Decision tree
  4. MARS
  5. Random Forest

Traditional cross-validation is inappropriate in the presence of inflation.

Solution: Observations of each day were split into an analysis (3/4) and an assessment set (1/4) using stratified sampling by the price and we tuned the models to reach lowest average RMSE.

Hyperparameter tuning

An example workflow for the estimation (tuned)1:

#> ══ Workflow ════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: decision_tree()
#> ── Preprocessor ────────────────────────────────────────────
#> 3 Recipe Steps
#> • step_rm()
#> • step_zv()
#> • step_normalize()
#> ── Model ───────────────────────────────────────────────────
#> Decision Tree Model Specification (regression)
#> Main Arguments:
#>   cost_complexity = 2.50100358331955e-08
#>   tree_depth = 11
#>   min_n = 31
#> Computational engine: rpart

Hyperparameter tuning

Maximum entropy design-based hyperparameter tuning for decision tree

Model performance

  wiggle = price_level |> 
    diff() |> 
    diff() %>%
    .^2 |> 
    sum(na.rm = TRUE)

The price level

  1. We estimated the price of each item in the “representative basket” (cluster centres)

  2. Calculate the weighted mean of them (w = # of cars in the cluster)

The price level

The price level

  • The wiggliness of the estimated price level is still high.

  • Possible reasons:

  1. Seasonality
  2. # of observations are not sufficient to estimate the price level accurately


# of observations

Comparing to the official data

