Lit Review: Comparing resampling methods

machine learning
literature review
Author

Sunny Hospital

Published

March 5, 2024

Article: Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation Link to the article

The article compares performance of the commonly used resampling techniques in machine/statistical learning using ridge regression models. There are many conflicting suggestions for methods and parameter values for fine tuning hyper parameters.

Using simulations, the author tries to answer the following questions:

  1. Which of the four resampling methods is most effective in selecting a suitable regulatiziation paremter \(\lambda\)?
  2. Does increasing the number of repetitions - from 10 to 50 - improve the performance of the resampling method?
  3. Keeping the number of repetitions constant, which approach, single-run cross-validation or repated cross-validation, performs better?
  4. For k=fold CV, what is an appropriate fold size k?
  5. Which randomization approach is more effective, Monte Carlo (sampling without replacement) or bootstrap (with replacement)?

Ridge regression model

In regression model where Y is the target (response) variable and X_1, X_2, ..XP are feature (independent) variables, our goal is to find _beta coefficients that minimize the SSE in OLS. Linear model could have an overfitting problem.

To deal with overfitting issue in linear models, the ridge regression model was used to apply regularization. The penality term was added to the SSE equation.

\[ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} ({\beta}_j^2) \]

The ridge regression model and MSE are used to find the lambda estimate. Data set is split into a training and validation sets where training set is used for parameter estimation and the validation set for evaluating overfit. One split of a dataset could create bias in MSE thus resampling methods are used to obtain MSE estimates.

Regularization results in shiriking the \(\beta_j\) toward 0 as \(\lambda\) becomes large (or \(\beta\) is same when \(\lambda\) =0).

Resampling methods

Monte Carlo

(Other names: repeated learning-testing, repeated holdout, random subsampling)

Monte Carlo sampling randomly shuffles the dataset (randomizing rows) and select the first 75% to be a training set and the remaining 25% as a validation set. This is sampling without replacement.
The validation error is averaged over n repetitions.

Bootstrap

Similar to Monte Carlo sampling where data are randomly selected from the dataset but with replacement.
The validation error is averaged over n repetitions.

Sunnys note

commonly (or by default) equal sampling probability?

k-fold CV (Cross-validation)

A dataset is divided into k partition at random (most commonly k=5 or 10). One partition is held out (set aside) as a validation set, and the remaining k-1 partitions of data are used as a training set. __The validation error_ is averaged over k repetitions.

Most extreme case of k-fold CV, known as LOOCV (Leave One Out CV) is k = n where 1 observation is held out at each iteration.

repeated k-fold CV

k-fold CV is done over n repetition and obtain \({n} x {k}\) validation errors.
The validation error is averaged over \({n} \times {k}\)

Note

most commonly use k=fold CV only once (rep =1) and repeated method is suggested as a way of obtaining more accurate and reliable estimates of error rates