Corrections Module#

Batch-effect correctors. Every class is a scikit-learn compatible transformer with fit / transform / fit_transform - batch and covariates are passed at construction time and realigned to X.index on each call, so correctors are safe inside sklearn.pipeline.Pipeline and cross-validation without leakage.

Base Class#

class maldibatchkit.BaseBatchCorrector(batch)[source]#

Bases: BaseEstimator, TransformerMixin

Base class for batch correctors that store batch at construction.

Subclasses must implement _fit_impl() and _transform_impl(). They should store fitted attributes as ..._ members so sklearn’s check_is_fitted works out of the box.

Parameters:

batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Batch labels for each sample. Passed once at construction; the correct subset is aligned to X.index on fit / transform.

Notes

Subclasses typically override __init__ to add method-specific hyperparameters but should always call super().__init__(batch=batch) and leave batch stored under self.batch so get_params() / set_params() work.

__init__(batch)[source]#
Parameters:

batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])

Return type:

None

fit(X, y=None)[source]#

Fit the corrector on the training rows supplied.

Parameters:
Returns:

self – Fitted estimator.

Return type:

BaseBatchCorrector

transform(X)[source]#

Apply the fitted correction to X.

Returns a pandas.DataFrame when the input was a DataFrame, or a numpy.ndarray otherwise.

Parameters:

X (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])

Return type:

Any

get_feature_names_out(input_features=None)[source]#

Return the feature names seen during fit().

Parameters:

input_features (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]] | None)

Return type:

ndarray[tuple[Any, ...], dtype[Any]]

All shipped correctors inherit from this class. Custom correctors should follow the same pattern - see Extending MaldiBatchKit for a full walkthrough.

ComBat Variants#

The core ComBat implementation lives in combatlearn; MaldiBatchKit re-exports the sklearn-compatible transformer and adds a thin MALDI-specific preset for the species-as-covariate case.

class maldibatchkit.ComBat(batch, *, discrete_covariates=None, continuous_covariates=None, method='johnson', parametric=True, mean_only=False, reference_batch=None, eps=1e-08, covbat_cov_thresh=0.9)[source]#

Bases: BaseEstimator, TransformerMixin

Pipeline-friendly wrapper around ComBatModel.

Stores batch (and optional covariates) passed at construction and appropriately uses them for separate fit and transform.

Parameters:
__init__(batch, *, discrete_covariates=None, continuous_covariates=None, method='johnson', parametric=True, mean_only=False, reference_batch=None, eps=1e-08, covbat_cov_thresh=0.9)[source]#
Parameters:
Return type:

None

fit(X, y=None)[source]#

Fit the ComBat model.

Parameters:
Returns:

self – Fitted estimator.

Return type:

ComBat

transform(X)[source]#

Transform the data using fitted ComBat parameters.

Parameters:

X (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Input data to transform.

Returns:

X_transformed – Batch-corrected data.

Return type:

DataFrame

get_feature_names_out(input_features=None)[source]#

Get output feature names for transform.

Parameters:

input_features (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]] | None) – Ignored. Present for API compatibility.

Returns:

feature_names_out – Feature names.

Return type:

ndarray[tuple[Any, ...], dtype[Any]]

Raises:

sklearn.exceptions.NotFittedError – If the estimator is not fitted.

Species-Aware ComBat#

class maldibatchkit.SpeciesAwareComBat(batch, *, species, continuous_covariates=None, parametric=True, mean_only=False, reference_batch=None, eps=1e-08)[source]#

Bases: ComBat

ComBat-Fortin preset with species as a protected biological covariate.

This is a thin convenience wrapper: all the work is delegated to combatlearn.ComBat with method='fortin' and species plugged in as the discrete_covariates argument. It exists only so MALDI users can write SpeciesAwareComBat(batch=..., species=...) instead of remembering which keyword corresponds to the covariate slot.

Parameters:

Notes

This class is deliberately minimal - it is not a new algorithm. Calling SpeciesAwareComBat(batch=b, species=s).fit_transform(X) is exactly equivalent to:

ComBat(batch=b, discrete_covariates=s, method='fortin').fit_transform(X)

Examples

>>> from maldibatchkit import SpeciesAwareComBat
>>> corrector = SpeciesAwareComBat(batch=batches, species=species)
>>> X_corrected = corrector.fit_transform(X)
__init__(batch, *, species, continuous_covariates=None, parametric=True, mean_only=False, reference_batch=None, eps=1e-08)[source]#
Parameters:
Return type:

None

Quality-Weighted ComBat#

class maldibatchkit.QualityWeightedComBat(batch, *, quality, parametric=True, reference_batch=None, eps=1e-08, max_iter=50, tol=0.0001)[source]#

Bases: BaseBatchCorrector

Weighted empirical-Bayes extension of Johnson-ComBat.

Parameters:
  • batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Batch labels.

  • quality (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Non-negative per-sample quality scores (typically SNR). Higher values mean a sample should contribute more to per-batch estimation. Internally rescaled to sum to n_samples.

  • parametric (bool) – Use the parametric EB fixed-point iteration (as in Johnson 2007). False falls back to the data estimates \(\\hat\\gamma\) / \(\\hat\\delta^2\) directly - this is useful as a sanity check.

  • reference_batch (Any | None) – Batch level to leave unchanged.

  • eps (float) – Numerical jitter.

  • max_iter (int) – Hard cap on the parametric fixed-point iterations.

  • tol (float) – Convergence tolerance on the max absolute change in \(\\gamma^*, \\delta^*\).

Variables:
  • batch_levels (np.ndarray) – Batch levels observed at fit time.

  • grand_mean (np.ndarray) – Weighted feature-wise grand mean (\(\\hat\\alpha\)).

  • pooled_var (np.ndarray) – Weighted feature-wise pooled variance (\(\\hat\\sigma^2\)).

  • gamma_star (np.ndarray) – Posterior batch means, shape (n_batches, n_features).

  • delta_star (np.ndarray) – Posterior batch variances, shape (n_batches, n_features).

  • effective_batch_sizes (np.ndarray) – \(\\bar w_i\) per batch (weighted sample counts).

  • n_iter (int) – Iterations actually taken by the parametric EB loop.

Examples

>>> from maldibatchkit import QualityWeightedComBat
>>> corrector = QualityWeightedComBat(batch=batches, quality=snr)
>>> X_corrected = corrector.fit_transform(X)

References

Johnson, W.E., Li, C. and Rabinovic, A. (2007). “Adjusting batch effects in microarray expression data using empirical Bayes methods.” Biostatistics 8(1): 118-127. - the unweighted predecessor of the scheme implemented here.

__init__(batch, *, quality, parametric=True, reference_batch=None, eps=1e-08, max_iter=50, tol=0.0001)[source]#
Parameters:
Return type:

None

Linear-Model Corrections#

class maldibatchkit.Limma(batch, *, design=None, eps=1e-08)[source]#

Bases: BaseBatchCorrector

Linear-model batch subtraction following limma::removeBatchEffect.

Parameters:
Variables:
  • batch_levels (np.ndarray) – Batch levels observed at fit time (in sorted order).

  • gamma (np.ndarray) – Estimated batch coefficients in the sum-to-zero contrast basis, of shape (n_contrasts, n_features).

  • contrasts (np.ndarray) – Sum-to-zero contrast matrix used at fit time.

Notes

  • Unknown batch levels at transform raise a ValueError. This is intentional: Limma’s subtraction is not defined for a batch that was not part of the design.

  • The design matrix is augmented with a constant column internally so the intercept is absorbed into the OLS fit; only the batch coefficients are subtracted on output.

References

Ritchie, M.E. et al. (2015) “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research 43(7): e47.

Examples

>>> from maldibatchkit import Limma
>>> corrector = Limma(batch=batches, design=species_dummies)
>>> X_corrected = corrector.fit_transform(X)
__init__(batch, *, design=None, eps=1e-08)[source]#
Parameters:
Return type:

None

Single-Cell-Style Integration#

class maldibatchkit.Harmony(batch, *, covariates=None, theta=2.0, max_iter=20, nclust=None, n_components=None, random_state=None, verbose=True, scale_before_pca=True)[source]#

Bases: BaseBatchCorrector

Sklearn-compatible Harmony with a closed-form transform.

Parameters:
  • batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Batch labels.

  • covariates (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]] | None) – Additional categorical covariates forwarded to harmonypy.run_harmony via the vars_use mechanism.

  • theta (float | list[float]) – Diversity clustering penalty in Harmony. Higher values encourage more aggressive batch mixing.

  • max_iter (int) – Harmony iteration cap.

  • nclust (int | None) – Number of Harmony clusters. When None, harmonypy’s default heuristic is used.

  • n_components (int | None) – Number of PCA components used for the built-in dimensionality- reduction stage. None (the default) picks min(50, n_samples - 1, n_features).

  • random_state (int | None) – Seed forwarded to harmonypy and to the PCA stage.

  • verbose (bool) – If False, suppress harmonypy’s own progress messages during fit. harmonypy configures its module-level logger eagerly at import time, so simply passing verbose=False to run_harmony is not enough; this flag also raises the harmonypy logger above INFO for the duration of the call. The user’s broader logging config is left untouched.

  • scale_before_pca (bool) – If True (the default), fit a StandardScaler on the training rows and apply it before the internal PCA. This gives every feature equal weight in the principal components and matches the convention from single-cell genomics workflows (Seurat / Scanpy ScaleData followed by PCA), where Harmony was originally developed. Without scaling, MALDI-TOF features with high intensity (a handful of intense peaks) dominate the PCs and Harmony’s soft-cluster geometry inherits that bias. The scaler is frozen at fit (like the PCA, the centroids Y, and the per-batch ridge W_batch) and reused verbatim on unseen rows at transform time, with the corrected output un-scaled back to the original measurement scale. Set to False to recover the pre-v0.2 behaviour (centred but not scaled before PCA).

Variables:
  • batch_levels (np.ndarray) – Batch levels in the order used to encode Phi_moe columns.

  • Y (np.ndarray of shape (n_components, K)) – L2-normalised cluster centroids frozen at fit time (in PCA space).

  • sigma (np.ndarray of shape (K,)) – Soft-assignment bandwidths per cluster.

  • W_batch (np.ndarray of shape (K, B, n_components)) – Per-cluster, per-batch ridge-regression correction vectors, in the same PCA space as Y_.

  • pca (sklearn.decomposition.PCA) – Fitted PCA, re-used verbatim on unseen rows at transform time.

  • X_fit (np.ndarray) – Training feature matrix (original space) cached for same-matrix short-circuiting.

  • n_clusters (int) – Number of clusters actually used by Harmony.

  • n_iter (int) – Harmony iterations observed at fit time.

Notes

  • The closed-form output on X_train is not quite bit-identical to harmonypy’s iterative correction because Harmony’s soft assignment is refined by a batch-diversity penalty during training (the update_R loop). We deliberately use the raw cosine-softmax assignment at transform time since the diversity correction is a training-time regulariser tied to the training batch distribution and would not apply to a held-out sample. When transform is called on the exact training matrix we short-circuit to the harmonypy output so that fit_transform matches upstream byte-for-byte.

  • Unseen batch levels at transform raise ValueError, mirroring ComBat / Limma.

  • Requires harmonypy; it is a core dependency of MaldiBatchKit, range-pinned to >=0.2.0,<2. The 2.x line is a C++ rewrite that drops the private attributes our closed-form transform relies on (_Phi_moe, _lamb, sigma); we will revisit this constraint once those are re-exposed upstream.

Examples

>>> from maldibatchkit import Harmony
>>> corrector = Harmony(batch=batches, random_state=0)
>>> X_train_c = corrector.fit_transform(X_train)
>>> X_test_c  = corrector.transform(X_test)   # closed-form on new rows
__init__(batch, *, covariates=None, theta=2.0, max_iter=20, nclust=None, n_components=None, random_state=None, verbose=True, scale_before_pca=True)[source]#
Parameters:
Return type:

None

Note

Harmony is fit-transform only, matching the behaviour of the upstream harmonypy reference - there is no clean test-time transform step.

Baselines#

Simple, auditable correctors useful as comparison points for more sophisticated methods.

class maldibatchkit.MedianCentering(batch)[source]#

Bases: BaseBatchCorrector

Subtract per-batch medians from each feature.

The fitted parameter is a (n_batches, n_features) table of medians. At transform time, the median for the batch each sample belongs to is subtracted. Batches that were absent at fit time fall back to the grand median learned from training data (with a warning); this keeps the transformer defined on unseen folds without leaking information from the test set.

Parameters:

batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Batch labels.

Variables:
  • batch_medians (pd.DataFrame) – Per-batch feature medians learned from the training data (index: batch level, columns: feature names).

  • grand_median (pd.Series) – Feature-wise median across all training rows. Used as a fallback for unseen batch levels at transform time.

Examples

>>> from maldibatchkit import MedianCentering
>>> corrector = MedianCentering(batch=batches)
>>> X_corrected = corrector.fit_transform(X)
class maldibatchkit.ZScorePerBatch(batch, *, eps=1e-08)[source]#

Bases: BaseBatchCorrector

Per-batch z-score normalisation.

For each training batch we learn a mean and standard deviation per feature. At transform time we subtract the mean and divide by the standard deviation of the sample’s batch (again falling back to grand statistics for unseen batches).

Parameters:
Variables:
  • batch_means (pd.DataFrame) – Per-batch feature means.

  • batch_stds (pd.DataFrame) – Per-batch feature standard deviations (floored at eps).

  • grand_mean (pd.Series) – Feature-wise mean across all training rows.

  • grand_std (pd.Series) – Feature-wise standard deviation across all training rows.

Examples

>>> from maldibatchkit import ZScorePerBatch
>>> corrector = ZScorePerBatch(batch=batches)
>>> X_corrected = corrector.fit_transform(X)
__init__(batch, *, eps=1e-08)[source]#
Parameters:
Return type:

None

class maldibatchkit.ReferenceScaling(batch, *, reference_batch=None, eps=1e-08)[source]#

Bases: BaseBatchCorrector

Rescale each batch so its per-feature mean matches a reference batch.

Every non-reference batch is multiplied by reference_mean / batch_mean (feature-wise), which is a simple multiplicative drift correction useful for mass-spectrometry intensities where batch-specific gain terms are plausible. The reference batch itself is left unchanged.

Parameters:
  • batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Batch labels.

  • reference_batch (Any | None) – Batch level to use as the reference. If None, the batch with the most training samples is chosen.

  • eps (float) – Floor on the denominator mean to avoid division by zero.

Variables:
  • reference_batch (Any) – The batch level actually used as reference at fit time.

  • scale_factors (pd.DataFrame) – Per-batch feature scale factors (reference_mean / batch_mean).

__init__(batch, *, reference_batch=None, eps=1e-08)[source]#
Parameters:
Return type:

None

class maldibatchkit.NoOpCorrector(batch)[source]#

Bases: BaseBatchCorrector

Pass-through corrector that returns X unchanged.

Parameters:

batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Batch labels. Kept on the signature so NoOpCorrector is a drop-in for any other corrector, but never used for correction.

Variables:
  • feature_names_in (np.ndarray) – Feature names captured at fit time.

  • n_features_in (int) – Number of features captured at fit time.

Examples

>>> from maldibatchkit import NoOpCorrector
>>> corrector = NoOpCorrector(batch=batches)
>>> X_out = corrector.fit_transform(X)
>>> (X_out == X).all().all()
True
__init__(batch)[source]#
Parameters:

batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])

Return type:

None

Meta-Corrector#

AutoCorrector exposes method as a settable hyperparameter so a single GridSearchCV can sweep across corrector families and let the downstream classifier score pick the winner. See Choosing a corrector for an end-to-end recipe.

class maldibatchkit.AutoCorrector(batch, *, method='combat-fortin', discrete_covariates=None, continuous_covariates=None, quality=None, species=None, reference_batch=None, method_kwargs=None)[source]#

Bases: BaseBatchCorrector

Meta-corrector with a swappable method hyperparameter.

Parameters:
Variables:
  • inner (BaseBatchCorrector) – The instantiated inner corrector, fitted to the training rows.

  • n_features_in (feature_names_in,) – Forwarded from the inner corrector.

Notes

Fitted attributes of the inner corrector (e.g. gamma_star_ on a ComBat inner) are reachable on the AutoCorrector itself via attribute fall-through (__getattr__), so check_is_fitted and downstream inspection work without further glue.

Examples

>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.pipeline import Pipeline
>>> from maldibatchkit import AutoCorrector
>>> pipe = Pipeline([
...     ("correct", AutoCorrector(batch=b)),
...     ("clf", LogisticRegression()),
... ])
>>> grid = GridSearchCV(
...     pipe,
...     param_grid={"correct__method": ["noop", "combat-fortin", "harmony"]},
...     scoring="roc_auc",
... )
>>> grid.fit(X, y)
__init__(batch, *, method='combat-fortin', discrete_covariates=None, continuous_covariates=None, quality=None, species=None, reference_batch=None, method_kwargs=None)[source]#
Parameters:
Return type:

None

MALDI-Specific Corrections#

class maldibatchkit.BatchAwareWarping(batch, *, reference='median', method='shift', n_segments=5, max_shift=50, dtw_radius=10, smooth_sigma=2.0, n_jobs=1)[source]#

Bases: BaseBatchCorrector

Per-batch m/z warping to a single global reference.

This transformer reuses maldiamrkit.alignment.Warping under the hood, but fits one warper per batch. At fit time we learn a global reference spectrum from all training rows, then fit a dedicated Warping instance per batch with that shared reference. At transform time each sample is routed through the warper of its batch.

Why this matters#

Mass-calibration drift is usually a per-acquisition-session phenomenon. Running a single global Warping can either be too aggressive (squashing real biological peaks in well-calibrated batches) or too conservative (leaving large shifts in a miscalibrated batch). Per-batch warpers share the reference but learn a batch-specific transformation.

param batch:

Batch labels.

type batch:

DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]

param reference:

How to build the global reference. "median" uses the median spectrum across all training rows. "batch_mean" uses the mean of per-batch medians (reduces bias towards large batches).

type reference:

str

param method:

Warping method passed through to maldiamrkit.alignment.Warping.

type method:

str

param n_segments:

Piecewise warping segments.

type n_segments:

int

param max_shift:

Maximum shift in bins.

type max_shift:

int

param dtw_radius:

DTW radius constraint.

type dtw_radius:

int

param smooth_sigma:

Gaussian smoothing for piecewise shifts.

type smooth_sigma:

float

param n_jobs:

Parallel jobs forwarded to Warping.

type n_jobs:

int

ivar reference_:

The global reference spectrum.

vartype reference_:

np.ndarray

ivar warpers_:

One fitted Warping per batch level.

vartype warpers_:

dict[Any, Warping]

Notes

For batches that do not appear at fit time but show up during transform, we fall back to a global warper trained on all training rows (stored as warpers_ under the "__global__" key). This keeps the transformer defined on test folds without peeking at test data.

Examples

>>> from maldibatchkit import BatchAwareWarping
>>> warper = BatchAwareWarping(batch=batches, method="piecewise")
>>> X_aligned = warper.fit_transform(X)
__init__(batch, *, reference='median', method='shift', n_segments=5, max_shift=50, dtw_radius=10, smooth_sigma=2.0, n_jobs=1)[source]#
Parameters:
Return type:

None

Parameters:
  • batch (ArrayLike)

  • reference (str)

  • method (str)

  • n_segments (int)

  • max_shift (int)

  • dtw_radius (int)

  • smooth_sigma (float)

  • n_jobs (int)

BatchAwareWarping reuses maldiamrkit.alignment.Warping internally to warp each batch onto a shared global reference. Pair it with an intensity-domain corrector (e.g. ComBat) inside a sklearn.pipeline.Pipeline for a full harmonisation:

from sklearn.pipeline import Pipeline
from maldibatchkit import BatchAwareWarping, ComBat

pipe = Pipeline([
    ("warp", BatchAwareWarping(batch=batch, method="piecewise")),
    ("combat", ComBat(batch=batch, method="fortin",
                      discrete_covariates=species)),
])
X_corrected = pipe.fit_transform(X)