---
title: "Leakage-Safe Covariate Handling in CPM"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Leakage-Safe Covariate Handling in CPM}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Why This Matters

When covariates are regressed out on the full dataset before cross-validation,
held-out test information leaks into the training pipeline. This can inflate
or distort out-of-sample performance.

In `cpmr`, leakage-safe behavior means:

- fit covariate regression on each fold's training split only;
- apply the learned regression parameters to both train and test data inside the same fold;
- keep other parameter-learning steps (for example scaling or model selection) inside CV as well.

## Minimal Example with Synthetic Data

```{r}
library(cpmr)

set.seed(123)
n <- 80
p <- 120

# Synthetic connectivity matrix (rows = subjects, cols = edges)
conmat <- matrix(rnorm(n * p), nrow = n, ncol = p)

# A synthetic covariate (for example age-like nuisance variable)
covariates <- matrix(rnorm(n), ncol = 1)

# Build behavior with both signal and covariate effects
edge_signal <- rowMeans(conmat[, 1:10, drop = FALSE])
behav <- 0.7 * edge_signal + 0.6 * covariates[, 1] + rnorm(n, sd = 0.5)

# Leakage-safe call: covariates are handled fold-wise inside CV
fit <- cpm(
  conmat = conmat,
  behav = behav,
  covariates = covariates,
  kfolds = 5
)

summary(fit)
```

## Migration from `confounds` to `covariates`

`covariates` is now the primary argument name.

```{r}
# Preferred
fit_new <- cpm(conmat, behav, covariates = covariates, kfolds = 5)
```

The old name `confounds` is still accepted as a deprecated alias for backward
compatibility.

```{r}
# Backward-compatible alias (deprecated)
fit_old <- cpm(conmat, behav, confounds = covariates, kfolds = 5)
```

Do not pass both in the same call.

```{r, error = TRUE}
cpm(conmat, behav, covariates = covariates, confounds = covariates)
```

## Anti-Leakage Checklist

Use this checklist when building CPM workflows:

- Split train/test folds before any parameter-learning step.
- Fit covariate regression on the training split only.
- Apply training-fitted covariate regression to the fold's test split.
- Keep scaling and feature selection inside each fold.
- If comparing multiple settings, use nested CV for model selection.
- Report the CV design and covariate strategy explicitly.

## References

- Snoek L, Miletic S, Scholte HS (2019). How to control for confounds in decoding analyses of neuroimaging data. NeuroImage. https://doi.org/10.1016/j.neuroimage.2018.09.074
- Scheinost D, Noble S, Horien C, et al. (2019). Ten simple rules for predictive modeling of individual differences in neuroimaging. NeuroImage. https://doi.org/10.1016/j.neuroimage.2019.02.057
- Rosenblatt M, Tejavibulya L, Jiang R, Noble S, Scheinost D (2024). Data leakage inflates prediction performance in connectome-based machine learning models. Nature Communications. https://doi.org/10.1038/s41467-024-46150-w