Quickstart Guide#
This guide walks through the MaldiBatchKit API from a single ComBat call to full train/test workflows, MaldiSet integration, and diagnostics.
All correctors expect a feature matrix X (rows = samples, columns =
m/z bins) and a batch vector aligned to X.index. When X is a
pandas.DataFrame, covariates can be passed as Series / DataFrames
with matching indices - the corrector realigns them on each fit and
transform call so the same object is safe inside sklearn pipelines.
Vanilla ComBat#
Apply Johnson (2007) ComBat to a single harmonisation:
from maldibatchkit import ComBat
corrector = ComBat(batch=batch, method="johnson")
X_corrected = corrector.fit_transform(X)
Covariate-Aware ComBat (Fortin)#
Protect biological covariates with Fortin (2018):
from maldibatchkit import ComBat
corrector = ComBat(
batch=batch,
discrete_covariates=species,
continuous_covariates=age,
method="fortin",
)
X_corrected = corrector.fit_transform(X)
Species-Aware Preset#
SpeciesAwareComBat is a convenience wrapper for
the common case of “Fortin with species as the protected covariate”:
from maldibatchkit import SpeciesAwareComBat
corrector = SpeciesAwareComBat(batch=batch, species=species)
X_corrected = corrector.fit_transform(X)
Quality-Weighted ComBat#
Weight samples by a per-sample quality score (typically SNR) so low-quality spectra contribute less to the shrinkage prior:
from maldibatchkit import QualityWeightedComBat
corrector = QualityWeightedComBat(
batch=batch,
quality=snr,
parametric=True,
max_iter=30,
)
X_corrected = corrector.fit_transform(X)
Train/Test Without Leakage#
Every corrector follows the standard sklearn fit / transform
split - call fit on the training data only, then transform each
split with the fitted parameters:
from sklearn.model_selection import train_test_split
from maldibatchkit import ComBat
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=batch
)
corrector = ComBat(
batch=batch, method="fortin", discrete_covariates=species,
)
corrector.fit(X_train) # learns on train only
X_train_c = corrector.transform(X_train)
X_test_c = corrector.transform(X_test) # same parameters on test
batch is indexed by the same sample IDs as X, so the corrector
slices the right subset on each call.
Inside an Sklearn Pipeline#
All correctors expose fit / transform / fit_transform and are
drop-in pipeline steps:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from maldibatchkit import ComBat
pipe = Pipeline([
("combat", ComBat(batch=batch, method="fortin",
discrete_covariates=species)),
("scaler", StandardScaler()),
("clf", RandomForestClassifier(n_estimators=200)),
])
scores = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")
print(f"AUROC: {scores.mean():.3f} +/- {scores.std():.3f}")
Linear-Model Subtraction (Limma)#
Limma is a pure-Python implementation of limma::removeBatchEffect
that subtracts an OLS fit of the batch indicators while protecting the
columns of design:
from maldibatchkit import Limma
corrector = Limma(batch=batch, design=species_design)
X_corrected = corrector.fit_transform(X)
Harmony#
Harmony wraps harmonypy for iterative
soft-clustering integration. Like the harmonypy reference, the
corrector is fit-transform only (no clean test-time application):
from maldibatchkit import Harmony
corrector = Harmony(
batch=batch,
covariates=extra_vars,
theta=2.0,
max_iter=20,
random_state=0,
)
X_corrected = corrector.fit_transform(X)
Simple Baselines#
Auditable, no-covariate alternatives useful for sanity checks:
from maldibatchkit import MedianCentering, ZScorePerBatch, ReferenceScaling
median = MedianCentering(batch=batch).fit_transform(X)
zscored = ZScorePerBatch(batch=batch).fit_transform(X)
ref = ReferenceScaling(batch=batch, reference_batch="site_A").fit_transform(X)
MALDI-Specific: Batch-Aware Warping#
BatchAwareWarping is a MALDI-specific corrector
that warps each batch onto a shared global reference before any
intensity-domain step:
from maldibatchkit import BatchAwareWarping
warper = BatchAwareWarping(
batch=batch,
reference="median",
method="piecewise",
n_segments=8,
max_shift=10,
n_jobs=-1,
)
X_warped = warper.fit_transform(X)
It is safe to chain with a downstream intensity-domain corrector:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("warp", BatchAwareWarping(batch=batch, method="piecewise")),
("combat", ComBat(batch=batch, method="fortin",
discrete_covariates=species)),
])
X_corrected = pipe.fit_transform(X)
Diagnostics#
The diagnostics subpackage provides generic
batch-mixing metrics (kBET, LISI, silhouette) and MALDI-specific
summaries (per-batch peak drift, TIC CoV, spectrum count):
from maldibatchkit.diagnostics import (
silhouette_batch, kbet, lisi,
peak_position_drift, tic_cov_per_batch,
diagnostic_report,
)
# Individual metrics
sil = silhouette_batch(X_corrected, batch)
kbet_stats = kbet(X_corrected, batch)
lisi_mean = lisi(X_corrected, batch, perplexity=30.0)
# One-shot before/after summary
report = diagnostic_report(
X, X_corrected, batch,
mz_values=mz, top_k_peaks=40,
)
print(report)
The report is a tidy DataFrame with metric / scope /
value_before / value_after / delta columns - suitable for
further pandas manipulation or for the bundled plotting helpers.
Visualization#
from maldibatchkit.viz import (
plot_batch_umap,
plot_peak_shift,
plot_diagnostic_summary,
)
# Side-by-side UMAP (requires maldibatchkit[viz])
plot_batch_umap(X, X_corrected, batch, random_state=0)
# Per-batch peak overlays against a reference spectrum
plot_peak_shift(batch, X_corrected, mz_values=mz)
# Before/after bar chart built from a diagnostic_report
plot_diagnostic_summary(report, scope="overall")
MaldiSet Integration#
MaldiSetAdapter lets you correct a
maldiamrkit.MaldiSet directly, returning a new MaldiSet
whose X is harmonised and whose AMR labels / metadata are untouched:
from maldiamrkit import MaldiSet
from maldibatchkit.integrations import MaldiSetAdapter
from maldibatchkit import SpeciesAwareComBat
ds = MaldiSet.from_directory(
"spectra/", "metadata.csv",
aggregate_by=dict(antibiotics="Ceftriaxone"),
)
adapter = MaldiSetAdapter(
batch_column="Batch",
species_column="Species",
quality_column="SNR",
)
corrected_ds = adapter.correct(ds, SpeciesAwareComBat)
corrected_ds.X # harmonised feature matrix
corrected_ds.y # AMR labels, unchanged
Command-Line Interface#
MaldiBatchKit ships a Typer-based CLI organised as
maldibatchkit correct <method> plus maldibatchkit diagnose.
Each method has its own subcommand with only the flags it actually uses:
# Vanilla Johnson ComBat
maldibatchkit correct combat \
-i X.csv --batch-csv batch.csv -o X_corrected.csv
# Species-aware preset
maldibatchkit correct species-combat \
-i X.csv --batch-csv batch.csv --species-csv species.csv \
-o X_corrected.csv
# Diagnostic report
maldibatchkit diagnose \
-i X.csv --corrected X_corrected.csv --batch-csv batch.csv \
--mz-csv mz.csv -o report.csv
See the CLI Reference for the full command tree, NPZ I/O format, and covariate CSV conventions.