Out of sample predictions from OLS regressions: a K-folds tutorial in R

Reading Time:

Let's say we want to evaluate some OLS models by estimating them on 80% of the data and then see how well we do on the 20% of the data that we had set aside. We could do random sub-sampling 1,000 times, which is easy to do with dplyr::sample_n but doing k-fold cross-validation -- so that each observation finds itself in the holdout sample exactly once -- takes a little bit more work in R.

Most suggestions I see online tell users to just use the caret package, but there are obviously other ways. I summarize a couple of them in this post.

To follow along, you can get some Eurobarometer data I posted on Github:

library(tidyverse)
EB <- read_csv("https://raw.githubusercontent.com/zilinskyjan/datasets/master/economic_sentiment/eurobarometer.csv")

The tibble contains the following variables for around 30 European countries between 2005 and 2016:

  • Subjective economic evaluations
  • Annual real GDP growth rate
  • Unemployment rate
  • Inflation
  • Other covariates

Let's say we wanted to predict the proportion of respondents who said in a given year that the state of the economy was good (variable name: eb_econgood) on the basis two fundamentals: GDP growth and the unemployment rate.

We will first use all observations to estimate a baseline model:

# Estimate a basic regression
ols_basic <- lm(eb_econgood ~ gdpgrowth + unemployment, data=EB)

All information contained in the ols_basic object is "in-sample", so it's possible we are over-estimating how well the model is doing. (We are actually not overfitting, deliberately using only two independent variables, but if we added a bunch of covariates, then we'd start worrying that in-sample predictions might not be credible.)

> summary(ols_basic)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   51.2438     2.4075  21.285  < 2e-16 ***
gdpgrowth      1.8420     0.3054   6.032 3.91e-09 ***
unemployment  -1.9966     0.1988 -10.043  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.62 on 373 degrees of freedom
Multiple R-squared:  0.2994,	Adjusted R-squared:  0.2957 
F-statistic: 79.71 on 2 and 373 DF,  p-value: < 2.2e-16

We didn't build a complicated model, so cross-validation arguably isn't necessary, but let's say that we believe statistics pertaining to errors (RMSE, MAE, etc.) are only really informative when they are calculated on those observations that were not used to estimate the model. Then we have a couple of options.

Approach 1: Build K folds manually

# Number of observations
N <- nrow(EB)
# Number of desired splits
folds <- 5

# Generate indices of holdout observations
holdout <- split(sample(1:N), 1:folds)

The list holdout contains five vectors of indices. You simply run five regressions, each time withholding one fifth of observations (those that were randomly assigned to a test set).

For example, the the first regression would be estimated using all observations except those numbered 203, 72, 76, ...

> str(holdout)
List of 5
 $ 1: int [1:76] 203 72 76 241 111 350 361 246 156 212 ...
 $ 2: int [1:75] 94 266 341 145 375 364 106 31 96 281 ...
 $ 3: int [1:75] 41 201 59 51 53 75 120 243 33 35 ...
 $ 4: int [1:75] 352 165 268 321 327 326 5 360 289 28 ...
 $ 5: int [1:75] 91 258 61 328 13 309 359 166 213 229 ...

Check that each observation appears exactly once in the holdout object:

# Defensive programming
> holdout %>% unlist() %>% length() == N
[1] TRUE

Excluding the observations in the first holdout set, we would run:

lm(eb_econgood ~ gdpgrowth + unemployment, data=EB[-holdout$`1`,]) %>% summary()

Fitting multiple models is then possible either with purrr:map or by looping over datasets.

Approach 2: Subsetting with modelr, crossv_kfold

Build a tibble with five train and test lists:

library(modelr)
library(purrr)
library(broom)
folding <- crossv_kfold(EB,k=5)

If we wanted to see the index numbers of the (first set of) 80% of observations in the training dataset, we would write:

folding$train$`1`$idx

And the indices of the remaining 20% of observations that will be used for cross-validation are contained in

> folding$test$`1`$idx
 [1]   4   9  12  23  29  34  39  40  43  48  52  55  58  76  78  89 102 106 108 118 119 131 138 149 151 154
[27] 155 156 157 163 166 170 171 172 173 177 178 181 182 183 190 192 207 217 224 227 228 252 276 278 279 283
[53] 285 287 291 292 293 294 296 298 305 308 316 317 318 324 325 333 335 347 349 352 354 359 369 371

We can also see details about how resampling was done by just looking at folding$test:

> folding$test
$`1`
<resample [76 x 14]> 4, 9, 12, 23, 29, 34, 39, 40, 43, 48, ...

$`2`
<resample [75 x 14]> 3, 11, 13, 24, 25, 27, 30, 38, 44, 51, ...

$`3`
<resample [75 x 14]> 7, 8, 10, 15, 18, 19, 21, 31, 33, 36, ...

$`4`
<resample [75 x 14]> 5, 14, 16, 20, 28, 35, 46, 54, 63, 64, ...

$`5`
<resample [75 x 14]> 1, 2, 6, 17, 22, 26, 32, 37, 41, 42, ...

Now estimate 5 OLS models on each of the five training data (sub)sets:

# Add OLS models to each row estimated in the train data
folding <- folding %>% mutate(ols = map(train, ~ lm(eb_econgood~ gdpgrowth + unemployment, data = .)))

The coefficients of the first of the five models are easy to see, and we could store these coefficients and p-values by running folding$ols$`1` %>% broom::tidy():

> folding$ols$`1` %>% broom::tidy()
          term  estimate std.error statistic      p.value
1  (Intercept) 51.403199 2.6880496 19.122861 1.080100e-53
2    gdpgrowth  1.793451 0.3335774  5.376417 1.540175e-07
3 unemployment -1.938227 0.2190075 -8.850047 8.039264e-17
> 

We can add tided coefficients to each row of folding, so that each model has an associated collection of coefficients:

folding %>% 
      mutate(tidy_coeffs = map(ols,tidy))

By unnesting the data frames that contain 5 sets of regression coefficients (one set for each of the models estimated on a particular training set), we can examine the ranges of coefficients across the models.

For example, it turns out that the coefficient on the GDP growth rate ranges from 1.73 to 2.01, meaning that the proportion of people who evaluate the economy positively is typically up to 2 percentage points higher when there was (historically) a 1 percentage point positive movement in GDP growth:

> folding %>% 
+       mutate(tidy_coeffs = map(ols,tidy)) %>% 
+       unnest(tidy_coeffs) %>%
+       filter(term=="gdpgrowth")
# A tibble: 5 x 6
  .id   term      estimate std.error statistic       p.value
  <chr> <chr>        <dbl>     <dbl>     <dbl>         <dbl>
1 1     gdpgrowth     1.73     0.339      5.10 0.000000611  
2 2     gdpgrowth     2.01     0.338      5.93 0.00000000830
3 3     gdpgrowth     1.77     0.346      5.12 0.000000544  
4 4     gdpgrowth     1.89     0.349      5.41 0.000000129  
5 5     gdpgrowth     1.81     0.338      5.37 0.000000162  

Now let's move on and calculate predicted percentages of respondents with positive economic evaluations based on objective economic indicators.

We are interested in calculating predictions for those country-year observations that were deliberately withheld when a particular model was estimated.

Option A: Calculate predictions based on all data

We can use map() but then we obtain more predictions than we really need:

folding <- folding %>% mutate(pred_ols = map(test, ~ predict(ols, newdata = .)))

Note that folding$pred_ols$`1`$`1` contains predictions from model 1 evaluated on the hold-out data that were not used to estimate model 1 (i.e. the test set #1), which is exactly what we want.

However, folding$pred_ols$`1`$`2` shows predictions from model 1 evaluated on the hold-out data in partition 2 (i.e. the test set #2) and that's not informative. By contrast, folding$pred_ols$`2`$`2` shows predictions from model 2 evaluated on the hold-out data that were not used to estimate model 2 (i.e. the test set #2).

Option B (PREFERRED): Focus on one test set for each model:

Use map2() like this to generate predictions for each of the models that were estimated in the training subsets:

folding <- folding %>% mutate(predictions_on_test_data = map2(ols, test, predict))

Or, even better, generate datasets that include not just predictions (.fitted) but also other information about each of the observations by using augment:

folding <- folding %>% mutate(test_set_with_predictions = map2(ols, test, ~ augment(.x, newdata = .y)))

Now combine all out-of-sample predictions into a single data frame:

holdout_datasets <- rbind(folding$test_set_with_predictions$`1`,
                          folding$test_set_with_predictions$`2`,
                          folding$test_set_with_predictions$`3`,
                          folding$test_set_with_predictions$`4`,
                          folding$test_set_with_predictions$`5`)

Calculate out-of-sample absolute errors:

holdout_datasets$oos_error <- abs(holdout_datasets$.fitted - holdout_datasets$eb_econgood)

Checking some predictions

We saw before that the first test set included observations number 4, 9, 12, 23, etc...

> folding$test[[1]]$data[c(4,9,12,23),]
# A tibble: 4 x 14
     X1 country  year iso3  eb_econgood gdpgrowth unemployment inflation   lfp lfp_men manufacturing trade
  <int> <chr>   <int> <chr>       <dbl>     <dbl>        <dbl>     <dbl> <dbl>   <dbl>         <dbl> <dbl>
1     4 Austria  2016 AUT          59.2      1.48         6.11     1.28   60.2    65.9          18.7  101.
2     9 Austria  2011 AUT          61.3      2.81         4.56     1.89   60.7    67.6          18.8  105.
3    12 Austria  2008 AUT          60        1.55         4.13     1.82   60.6    68.4          19.6  102.
4    23 Belgium  2009 BEL          22.4     -2.29         7.91     0.811  53.5    60.4          14.3  136.
# ... with 2 more variables: gini <dbl>, wage_growth <dbl>

Let's look at the 12th observation overall (Austria in year 2008), which is NOT used to estimate the first model (because it's not included in the first training set). We see above that it is the 3rd observation in the first holdout dataset. The prediction, stored in the .fitted column is 46.2:

> folding$test_set_with_predictions$`1`[1:4,c("country", "year", "iso3", "eb_econgood", ".fitted")]
  country year iso3 eb_econgood  .fitted
1 Austria 2016  AUT    59.16179 42.21819
2 Austria 2011  AUT    61.27745 47.59314
3 Austria 2008  AUT    60.00000 46.17714
4 Belgium 2009  BEL    22.40326 31.97746
> 

Approach 3: caret package

Caret is useful because it provides a common interface for dozens of models, including "lm".

A 5-fold cross-validation can be performed using just a few lines of code:

ols_cv <- train(eb_econgood~ gdpgrowth + unemployment,
                  data = EB, 
                  method = "lm",
                  trControl=trainControl(
                    method = "cv",
                    number=5,
                    savePredictions = TRUE,
                    verboseIter = TRUE)
)

How well is the model doing?

ols_cv$results
  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 21.62361 0.3021111 18.06207 1.395237 0.09036339 1.717685
> 

The mean absolute error is 18 percentage points - a lot more than would seem acceptable.

Let's inspect ols_cv$pred to check some predictions using the first hold-out subsample of the data:

> ols_cv$pred
           pred        obs rowIndex intercept Resample
1    40.9548193 50.9471588        5      TRUE    Fold1
2    46.6482610 61.2774467        9      TRUE    Fold1
3    47.6249763 82.8571396       13      TRUE    Fold1

What is the 13th observation it the dataset? That's Austria in 2007: at the time, 82.9% of respondents said that the economy was doing well:

> EB[13,]
# A tibble: 1 x 14
     X1 country  year iso3  eb_econgood gdpgrowth unemployment inflation   lfp lfp_men manufacturing
  <int> <chr>   <int> <chr>       <dbl>     <dbl>        <dbl>     <dbl> <dbl>   <dbl>         <dbl>
1    13 Austria  2007 AUT          82.9      3.62         4.86      2.25  60.5    68.8          20.5
# ... with 3 more variables: trade <dbl>, gini <dbl>, wage_growth <dbl>

Austria's 2007 data was not used to estimate the first linear model, so the prediction reported in ols_cv$pred (47.2%) is out-of sample.

If instead we look at the prediction from the final model, we see that it is slightly better:

> predict(ols_cv)[13]
      13 
48.21528 
> 

References

  • Matt Taddy's slides on in-sample vs. OOS inference should be required reading: see the github repo.
  • Here are some other excellent slides on cross-validation.

What bothered voters about Hillary Clinton?

Some Democrats were put off by Clinton-related controversies in the months before the 2016 election....