Corrections Module#
Batch-effect correctors. Every class is a scikit-learn compatible
transformer with fit / transform / fit_transform - batch
and covariates are passed at construction time and realigned to
X.index on each call, so correctors are safe inside
sklearn.pipeline.Pipeline and cross-validation without leakage.
Base Class#
- class maldibatchkit.BaseBatchCorrector(batch)[source]#
Bases:
BaseEstimator,TransformerMixinBase class for batch correctors that store
batchat construction.Subclasses must implement
_fit_impl()and_transform_impl(). They should store fitted attributes as..._members so sklearn’scheck_is_fittedworks out of the box.- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Batch labels for each sample. Passed once at construction; the correct subset is aligned toX.indexonfit/transform.
Notes
Subclasses typically override
__init__to add method-specific hyperparameters but should always callsuper().__init__(batch=batch)and leavebatchstored underself.batchsoget_params()/set_params()work.- transform(X)[source]#
Apply the fitted correction to
X.Returns a
pandas.DataFramewhen the input was a DataFrame, or anumpy.ndarrayotherwise.
All shipped correctors inherit from this class. Custom correctors should follow the same pattern - see Extending MaldiBatchKit for a full walkthrough.
ComBat Variants#
The core ComBat implementation lives in combatlearn; MaldiBatchKit re-exports the sklearn-compatible transformer and adds a thin MALDI-specific preset for the species-as-covariate case.
- class maldibatchkit.ComBat(batch, *, discrete_covariates=None, continuous_covariates=None, method='johnson', parametric=True, mean_only=False, reference_batch=None, eps=1e-08, covbat_cov_thresh=0.9)[source]#
Bases:
BaseEstimator,TransformerMixinPipeline-friendly wrapper around ComBatModel.
Stores batch (and optional covariates) passed at construction and appropriately uses them for separate fit and transform.
- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Batch labels for each sample.discrete_covariates (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None) – Categorical covariates to protect (Fortin/Chen only).continuous_covariates (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None) – Continuous covariates to protect (Fortin/Chen only).method (
str) – ComBat variant to use.parametric (
bool) – Use parametric empirical Bayes.mean_only (
bool) – Adjust only the mean (ignore variance).reference_batch (
str|None) – Batch level to leave unchanged.eps (
float) – Numerical jitter for stability.covbat_cov_thresh (
float|int) – CovBat variance threshold for PCs.
- __init__(batch, *, discrete_covariates=None, continuous_covariates=None, method='johnson', parametric=True, mean_only=False, reference_batch=None, eps=1e-08, covbat_cov_thresh=0.9)[source]#
- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])discrete_covariates (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None)continuous_covariates (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None)method (
str)parametric (
bool)mean_only (
bool)eps (
float)
- Return type:
None
- get_feature_names_out(input_features=None)[source]#
Get output feature names for transform.
Species-Aware ComBat#
- class maldibatchkit.SpeciesAwareComBat(batch, *, species, continuous_covariates=None, parametric=True, mean_only=False, reference_batch=None, eps=1e-08)[source]#
Bases:
ComBatComBat-Fortin preset with species as a protected biological covariate.
This is a thin convenience wrapper: all the work is delegated to
combatlearn.ComBatwithmethod='fortin'andspeciesplugged in as thediscrete_covariatesargument. It exists only so MALDI users can writeSpeciesAwareComBat(batch=..., species=...)instead of remembering which keyword corresponds to the covariate slot.- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Batch labels for each sample.species (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Species labels for each sample. Passed tocombatlearn’s Fortin variant asdiscrete_covariatesso per-species biological structure is preserved during correction.continuous_covariates (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None) – Additional continuous covariates to protect.parametric (
bool) – Use parametric empirical Bayes.mean_only (
bool) – Adjust only the mean (ignore variance).reference_batch (
str|None) – Batch level to leave unchanged.eps (
float) – Numerical jitter for stability.
Notes
This class is deliberately minimal - it is not a new algorithm. Calling
SpeciesAwareComBat(batch=b, species=s).fit_transform(X)is exactly equivalent to:ComBat(batch=b, discrete_covariates=s, method='fortin').fit_transform(X)
Examples
>>> from maldibatchkit import SpeciesAwareComBat >>> corrector = SpeciesAwareComBat(batch=batches, species=species) >>> X_corrected = corrector.fit_transform(X)
Quality-Weighted ComBat#
- class maldibatchkit.QualityWeightedComBat(batch, *, quality, parametric=True, reference_batch=None, eps=1e-08, max_iter=50, tol=0.0001)[source]#
Bases:
BaseBatchCorrectorWeighted empirical-Bayes extension of Johnson-ComBat.
- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Batch labels.quality (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Non-negative per-sample quality scores (typically SNR). Higher values mean a sample should contribute more to per-batch estimation. Internally rescaled to sum ton_samples.parametric (
bool) – Use the parametric EB fixed-point iteration (as in Johnson 2007).Falsefalls back to the data estimates \(\\hat\\gamma\) / \(\\hat\\delta^2\) directly - this is useful as a sanity check.reference_batch (
Any|None) – Batch level to leave unchanged.eps (
float) – Numerical jitter.max_iter (
int) – Hard cap on the parametric fixed-point iterations.tol (
float) – Convergence tolerance on the max absolute change in \(\\gamma^*, \\delta^*\).
- Variables:
batch_levels (np.ndarray) – Batch levels observed at fit time.
grand_mean (np.ndarray) – Weighted feature-wise grand mean (\(\\hat\\alpha\)).
pooled_var (np.ndarray) – Weighted feature-wise pooled variance (\(\\hat\\sigma^2\)).
gamma_star (np.ndarray) – Posterior batch means, shape
(n_batches, n_features).delta_star (np.ndarray) – Posterior batch variances, shape
(n_batches, n_features).effective_batch_sizes (np.ndarray) – \(\\bar w_i\) per batch (weighted sample counts).
n_iter (int) – Iterations actually taken by the parametric EB loop.
Examples
>>> from maldibatchkit import QualityWeightedComBat >>> corrector = QualityWeightedComBat(batch=batches, quality=snr) >>> X_corrected = corrector.fit_transform(X)
References
Johnson, W.E., Li, C. and Rabinovic, A. (2007). “Adjusting batch effects in microarray expression data using empirical Bayes methods.” Biostatistics 8(1): 118-127. - the unweighted predecessor of the scheme implemented here.
Linear-Model Corrections#
- class maldibatchkit.Limma(batch, *, design=None, eps=1e-08)[source]#
Bases:
BaseBatchCorrectorLinear-model batch subtraction following
limma::removeBatchEffect.- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Batch labels.design (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None) – Protected covariates (the “design of interest” in Limma). Any effect explained bydesignis left in the output.eps (
float) – Ridge-like regularisation added to the diagonal of the normal equations to keep things numerically stable on rank-deficient designs.
- Variables:
batch_levels (np.ndarray) – Batch levels observed at
fittime (in sorted order).gamma (np.ndarray) – Estimated batch coefficients in the sum-to-zero contrast basis, of shape
(n_contrasts, n_features).contrasts (np.ndarray) – Sum-to-zero contrast matrix used at fit time.
Notes
Unknown batch levels at
transformraise aValueError. This is intentional: Limma’s subtraction is not defined for a batch that was not part of the design.The design matrix is augmented with a constant column internally so the intercept is absorbed into the OLS fit; only the batch coefficients are subtracted on output.
References
Ritchie, M.E. et al. (2015) “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research 43(7): e47.
Examples
>>> from maldibatchkit import Limma >>> corrector = Limma(batch=batches, design=species_dummies) >>> X_corrected = corrector.fit_transform(X)
Single-Cell-Style Integration#
- class maldibatchkit.Harmony(batch, *, covariates=None, theta=2.0, max_iter=20, nclust=None, n_components=None, random_state=None, verbose=True, scale_before_pca=True)[source]#
Bases:
BaseBatchCorrectorSklearn-compatible Harmony with a closed-form
transform.- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Batch labels.covariates (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None) – Additional categorical covariates forwarded toharmonypy.run_harmonyvia thevars_usemechanism.theta (
float|list[float]) – Diversity clustering penalty in Harmony. Higher values encourage more aggressive batch mixing.max_iter (
int) – Harmony iteration cap.nclust (
int|None) – Number of Harmony clusters. WhenNone,harmonypy’s default heuristic is used.n_components (
int|None) – Number of PCA components used for the built-in dimensionality- reduction stage.None(the default) picksmin(50, n_samples - 1, n_features).random_state (
int|None) – Seed forwarded to harmonypy and to the PCA stage.verbose (
bool) – IfFalse, suppressharmonypy’s own progress messages duringfit.harmonypyconfigures its module-level logger eagerly at import time, so simply passingverbose=Falsetorun_harmonyis not enough; this flag also raises theharmonypylogger above INFO for the duration of the call. The user’s broader logging config is left untouched.scale_before_pca (
bool) – IfTrue(the default), fit aStandardScaleron the training rows and apply it before the internal PCA. This gives every feature equal weight in the principal components and matches the convention from single-cell genomics workflows (Seurat / ScanpyScaleDatafollowed by PCA), where Harmony was originally developed. Without scaling, MALDI-TOF features with high intensity (a handful of intense peaks) dominate the PCs and Harmony’s soft-cluster geometry inherits that bias. The scaler is frozen atfit(like the PCA, the centroidsY, and the per-batch ridgeW_batch) and reused verbatim on unseen rows attransformtime, with the corrected output un-scaled back to the original measurement scale. Set toFalseto recover the pre-v0.2 behaviour (centred but not scaled before PCA).
- Variables:
batch_levels (np.ndarray) – Batch levels in the order used to encode
Phi_moecolumns.Y (np.ndarray of shape (n_components, K)) – L2-normalised cluster centroids frozen at fit time (in PCA space).
sigma (np.ndarray of shape (K,)) – Soft-assignment bandwidths per cluster.
W_batch (np.ndarray of shape (K, B, n_components)) – Per-cluster, per-batch ridge-regression correction vectors, in the same PCA space as
Y_.pca (sklearn.decomposition.PCA) – Fitted PCA, re-used verbatim on unseen rows at transform time.
X_fit (np.ndarray) – Training feature matrix (original space) cached for same-matrix short-circuiting.
n_clusters (int) – Number of clusters actually used by Harmony.
n_iter (int) – Harmony iterations observed at fit time.
Notes
The closed-form output on
X_trainis not quite bit-identical toharmonypy’s iterative correction because Harmony’s soft assignment is refined by a batch-diversity penalty during training (theupdate_Rloop). We deliberately use the raw cosine-softmax assignment at transform time since the diversity correction is a training-time regulariser tied to the training batch distribution and would not apply to a held-out sample. Whentransformis called on the exact training matrix we short-circuit to the harmonypy output so thatfit_transformmatches upstream byte-for-byte.Unseen batch levels at transform raise
ValueError, mirroringComBat/Limma.Requires
harmonypy; it is a core dependency of MaldiBatchKit, range-pinned to>=0.2.0,<2. The 2.x line is a C++ rewrite that drops the private attributes our closed-form transform relies on (_Phi_moe,_lamb,sigma); we will revisit this constraint once those are re-exposed upstream.
Examples
>>> from maldibatchkit import Harmony >>> corrector = Harmony(batch=batches, random_state=0) >>> X_train_c = corrector.fit_transform(X_train) >>> X_test_c = corrector.transform(X_test) # closed-form on new rows
Note
Harmony is fit-transform only, matching the
behaviour of the upstream harmonypy reference - there is no
clean test-time transform step.
Baselines#
Simple, auditable correctors useful as comparison points for more sophisticated methods.
- class maldibatchkit.MedianCentering(batch)[source]#
Bases:
BaseBatchCorrectorSubtract per-batch medians from each feature.
The fitted parameter is a
(n_batches, n_features)table of medians. Attransformtime, the median for the batch each sample belongs to is subtracted. Batches that were absent atfittime fall back to the grand median learned from training data (with a warning); this keeps the transformer defined on unseen folds without leaking information from the test set.- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Batch labels.- Variables:
batch_medians (pd.DataFrame) – Per-batch feature medians learned from the training data (index: batch level, columns: feature names).
grand_median (pd.Series) – Feature-wise median across all training rows. Used as a fallback for unseen batch levels at transform time.
Examples
>>> from maldibatchkit import MedianCentering >>> corrector = MedianCentering(batch=batches) >>> X_corrected = corrector.fit_transform(X)
- class maldibatchkit.ZScorePerBatch(batch, *, eps=1e-08)[source]#
Bases:
BaseBatchCorrectorPer-batch z-score normalisation.
For each training batch we learn a mean and standard deviation per feature. At
transformtime we subtract the mean and divide by the standard deviation of the sample’s batch (again falling back to grand statistics for unseen batches).- Parameters:
- Variables:
batch_means (pd.DataFrame) – Per-batch feature means.
batch_stds (pd.DataFrame) – Per-batch feature standard deviations (floored at
eps).grand_mean (pd.Series) – Feature-wise mean across all training rows.
grand_std (pd.Series) – Feature-wise standard deviation across all training rows.
Examples
>>> from maldibatchkit import ZScorePerBatch >>> corrector = ZScorePerBatch(batch=batches) >>> X_corrected = corrector.fit_transform(X)
- class maldibatchkit.ReferenceScaling(batch, *, reference_batch=None, eps=1e-08)[source]#
Bases:
BaseBatchCorrectorRescale each batch so its per-feature mean matches a reference batch.
Every non-reference batch is multiplied by
reference_mean / batch_mean(feature-wise), which is a simple multiplicative drift correction useful for mass-spectrometry intensities where batch-specific gain terms are plausible. The reference batch itself is left unchanged.- Parameters:
- Variables:
reference_batch (Any) – The batch level actually used as reference at
fittime.scale_factors (pd.DataFrame) – Per-batch feature scale factors (
reference_mean / batch_mean).
- class maldibatchkit.NoOpCorrector(batch)[source]#
Bases:
BaseBatchCorrectorPass-through corrector that returns
Xunchanged.- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Batch labels. Kept on the signature soNoOpCorrectoris a drop-in for any other corrector, but never used for correction.- Variables:
feature_names_in (np.ndarray) – Feature names captured at
fittime.n_features_in (int) – Number of features captured at
fittime.
Examples
>>> from maldibatchkit import NoOpCorrector >>> corrector = NoOpCorrector(batch=batches) >>> X_out = corrector.fit_transform(X) >>> (X_out == X).all().all() True
Meta-Corrector#
AutoCorrector exposes method as a settable
hyperparameter so a single GridSearchCV can sweep across corrector
families and let the downstream classifier score pick the winner. See
Choosing a corrector for an end-to-end recipe.
- class maldibatchkit.AutoCorrector(batch, *, method='combat-fortin', discrete_covariates=None, continuous_covariates=None, quality=None, species=None, reference_batch=None, method_kwargs=None)[source]#
Bases:
BaseBatchCorrectorMeta-corrector with a swappable
methodhyperparameter.- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Batch labels.method (
Any) – Either a registered string alias (seeMETHOD_REGISTRY) or aBaseBatchCorrectorsubclass.discrete_covariates (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None) – Categorical covariates. Forwarded under the kwarg the inner method actually understands (see module docstring).continuous_covariates (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None) – Continuous covariates. Forwarded to ComBat variants; dropped with a warning for methods that don’t accept them.quality (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None) – Per-sample quality weights. Only used byQualityWeightedComBat.species (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None) – Species labels. Only used bySpeciesAwareComBat.reference_batch (
Any|None) – Reference batch level. Forwarded where the inner method accepts it.method_kwargs (
dict[str,Any] |None) – Extra kwargs merged into the inner method’s__init__call.method_kwargsentries override any conflicting argument synthesised from the named parameters above; entries whose key is not accepted by the inner__init__are dropped silently.
- Variables:
inner (BaseBatchCorrector) – The instantiated inner corrector, fitted to the training rows.
n_features_in (feature_names_in,) – Forwarded from the inner corrector.
Notes
Fitted attributes of the inner corrector (e.g.
gamma_star_on a ComBat inner) are reachable on theAutoCorrectoritself via attribute fall-through (__getattr__), socheck_is_fittedand downstream inspection work without further glue.Examples
>>> from sklearn.linear_model import LogisticRegression >>> from sklearn.model_selection import GridSearchCV >>> from sklearn.pipeline import Pipeline >>> from maldibatchkit import AutoCorrector >>> pipe = Pipeline([ ... ("correct", AutoCorrector(batch=b)), ... ("clf", LogisticRegression()), ... ]) >>> grid = GridSearchCV( ... pipe, ... param_grid={"correct__method": ["noop", "combat-fortin", "harmony"]}, ... scoring="roc_auc", ... ) >>> grid.fit(X, y)
- __init__(batch, *, method='combat-fortin', discrete_covariates=None, continuous_covariates=None, quality=None, species=None, reference_batch=None, method_kwargs=None)[source]#
- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])method (
Any)discrete_covariates (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None)continuous_covariates (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None)quality (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None)species (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]] |None)
- Return type:
None
MALDI-Specific Corrections#
- class maldibatchkit.BatchAwareWarping(batch, *, reference='median', method='shift', n_segments=5, max_shift=50, dtw_radius=10, smooth_sigma=2.0, n_jobs=1)[source]#
Bases:
BaseBatchCorrectorPer-batch m/z warping to a single global reference.
This transformer reuses
maldiamrkit.alignment.Warpingunder the hood, but fits one warper per batch. Atfittime we learn a global reference spectrum from all training rows, then fit a dedicatedWarpinginstance per batch with that shared reference. Attransformtime each sample is routed through the warper of its batch.Why this matters#
Mass-calibration drift is usually a per-acquisition-session phenomenon. Running a single global
Warpingcan either be too aggressive (squashing real biological peaks in well-calibrated batches) or too conservative (leaving large shifts in a miscalibrated batch). Per-batch warpers share the reference but learn a batch-specific transformation.- param batch:
Batch labels.
- type batch:
- param reference:
How to build the global reference.
"median"uses the median spectrum across all training rows."batch_mean"uses the mean of per-batch medians (reduces bias towards large batches).- type reference:
- param method:
Warping method passed through to
maldiamrkit.alignment.Warping.- type method:
- param n_segments:
Piecewise warping segments.
- type n_segments:
- param max_shift:
Maximum shift in bins.
- type max_shift:
- param dtw_radius:
DTW radius constraint.
- type dtw_radius:
- param smooth_sigma:
Gaussian smoothing for piecewise shifts.
- type smooth_sigma:
- param n_jobs:
Parallel jobs forwarded to
Warping.- type n_jobs:
- ivar reference_:
The global reference spectrum.
- vartype reference_:
np.ndarray
- ivar warpers_:
One fitted
Warpingper batch level.- vartype warpers_:
dict[Any, Warping]
Notes
For batches that do not appear at
fittime but show up duringtransform, we fall back to a global warper trained on all training rows (stored aswarpers_under the"__global__"key). This keeps the transformer defined on test folds without peeking at test data.Examples
>>> from maldibatchkit import BatchAwareWarping >>> warper = BatchAwareWarping(batch=batches, method="piecewise") >>> X_aligned = warper.fit_transform(X)
BatchAwareWarping reuses maldiamrkit.alignment.Warping
internally to warp each batch onto a shared global reference. Pair it
with an intensity-domain corrector (e.g. ComBat)
inside a sklearn.pipeline.Pipeline for a full harmonisation:
from sklearn.pipeline import Pipeline
from maldibatchkit import BatchAwareWarping, ComBat
pipe = Pipeline([
("warp", BatchAwareWarping(batch=batch, method="piecewise")),
("combat", ComBat(batch=batch, method="fortin",
discrete_covariates=species)),
])
X_corrected = pipe.fit_transform(X)