Quickstart Guide#

This guide walks through the MaldiBatchKit API from a single ComBat call to full train/test workflows, MaldiSet integration, and diagnostics.

All correctors expect a feature matrix X (rows = samples, columns = m/z bins) and a batch vector aligned to X.index. When X is a pandas.DataFrame, covariates can be passed as Series / DataFrames with matching indices - the corrector realigns them on each fit and transform call so the same object is safe inside sklearn pipelines.

Vanilla ComBat#

Apply Johnson (2007) ComBat to a single harmonisation:

from maldibatchkit import ComBat

corrector = ComBat(batch=batch, method="johnson")
X_corrected = corrector.fit_transform(X)

Covariate-Aware ComBat (Fortin)#

Protect biological covariates with Fortin (2018):

from maldibatchkit import ComBat

corrector = ComBat(
    batch=batch,
    discrete_covariates=species,
    continuous_covariates=age,
    method="fortin",
)
X_corrected = corrector.fit_transform(X)

Species-Aware Preset#

SpeciesAwareComBat is a convenience wrapper for the common case of “Fortin with species as the protected covariate”:

from maldibatchkit import SpeciesAwareComBat

corrector = SpeciesAwareComBat(batch=batch, species=species)
X_corrected = corrector.fit_transform(X)

Quality-Weighted ComBat#

Weight samples by a per-sample quality score (typically SNR) so low-quality spectra contribute less to the shrinkage prior:

from maldibatchkit import QualityWeightedComBat

corrector = QualityWeightedComBat(
    batch=batch,
    quality=snr,
    parametric=True,
    max_iter=30,
)
X_corrected = corrector.fit_transform(X)

Train/Test Without Leakage#

Every corrector follows the standard sklearn fit / transform split - call fit on the training data only, then transform each split with the fitted parameters:

from sklearn.model_selection import train_test_split
from maldibatchkit import ComBat

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=batch
)

corrector = ComBat(
    batch=batch, method="fortin", discrete_covariates=species,
)
corrector.fit(X_train)              # learns on train only
X_train_c = corrector.transform(X_train)
X_test_c = corrector.transform(X_test)   # same parameters on test

batch is indexed by the same sample IDs as X, so the corrector slices the right subset on each call.

Inside an Sklearn Pipeline#

All correctors expose fit / transform / fit_transform and are drop-in pipeline steps:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from maldibatchkit import ComBat

pipe = Pipeline([
    ("combat", ComBat(batch=batch, method="fortin",
                      discrete_covariates=species)),
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(n_estimators=200)),
])

scores = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")
print(f"AUROC: {scores.mean():.3f} +/- {scores.std():.3f}")

Linear-Model Subtraction (Limma)#

Limma is a pure-Python implementation of limma::removeBatchEffect that subtracts an OLS fit of the batch indicators while protecting the columns of design:

from maldibatchkit import Limma

corrector = Limma(batch=batch, design=species_design)
X_corrected = corrector.fit_transform(X)

Harmony#

Harmony wraps harmonypy for iterative soft-clustering integration. Like the harmonypy reference, the corrector is fit-transform only (no clean test-time application):

from maldibatchkit import Harmony

corrector = Harmony(
    batch=batch,
    covariates=extra_vars,
    theta=2.0,
    max_iter=20,
    random_state=0,
)
X_corrected = corrector.fit_transform(X)

Simple Baselines#

Auditable, no-covariate alternatives useful for sanity checks:

from maldibatchkit import MedianCentering, ZScorePerBatch, ReferenceScaling

median = MedianCentering(batch=batch).fit_transform(X)
zscored = ZScorePerBatch(batch=batch).fit_transform(X)
ref = ReferenceScaling(batch=batch, reference_batch="site_A").fit_transform(X)

MALDI-Specific: Batch-Aware Warping#

BatchAwareWarping is a MALDI-specific corrector that warps each batch onto a shared global reference before any intensity-domain step:

from maldibatchkit import BatchAwareWarping

warper = BatchAwareWarping(
    batch=batch,
    reference="median",
    method="piecewise",
    n_segments=8,
    max_shift=10,
    n_jobs=-1,
)
X_warped = warper.fit_transform(X)

It is safe to chain with a downstream intensity-domain corrector:

from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("warp", BatchAwareWarping(batch=batch, method="piecewise")),
    ("combat", ComBat(batch=batch, method="fortin",
                      discrete_covariates=species)),
])
X_corrected = pipe.fit_transform(X)

Diagnostics#

The diagnostics subpackage provides generic batch-mixing metrics (kBET, LISI, silhouette) and MALDI-specific summaries (per-batch peak drift, TIC CoV, spectrum count):

from maldibatchkit.diagnostics import (
    silhouette_batch, kbet, lisi,
    peak_position_drift, tic_cov_per_batch,
    diagnostic_report,
)

# Individual metrics
sil = silhouette_batch(X_corrected, batch)
kbet_stats = kbet(X_corrected, batch)
lisi_mean = lisi(X_corrected, batch, perplexity=30.0)

# One-shot before/after summary
report = diagnostic_report(
    X, X_corrected, batch,
    mz_values=mz, top_k_peaks=40,
)
print(report)

The report is a tidy DataFrame with metric / scope / value_before / value_after / delta columns - suitable for further pandas manipulation or for the bundled plotting helpers.

Visualization#

from maldibatchkit.viz import (
    plot_batch_umap,
    plot_peak_shift,
    plot_diagnostic_summary,
)

# Side-by-side UMAP (requires maldibatchkit[viz])
plot_batch_umap(X, X_corrected, batch, random_state=0)

# Per-batch peak overlays against a reference spectrum
plot_peak_shift(batch, X_corrected, mz_values=mz)

# Before/after bar chart built from a diagnostic_report
plot_diagnostic_summary(report, scope="overall")

MaldiSet Integration#

MaldiSetAdapter lets you correct a maldiamrkit.MaldiSet directly, returning a new MaldiSet whose X is harmonised and whose AMR labels / metadata are untouched:

from maldiamrkit import MaldiSet
from maldibatchkit.integrations import MaldiSetAdapter
from maldibatchkit import SpeciesAwareComBat

ds = MaldiSet.from_directory(
    "spectra/", "metadata.csv",
    aggregate_by=dict(antibiotics="Ceftriaxone"),
)

adapter = MaldiSetAdapter(
    batch_column="Batch",
    species_column="Species",
    quality_column="SNR",
)
corrected_ds = adapter.correct(ds, SpeciesAwareComBat)

corrected_ds.X      # harmonised feature matrix
corrected_ds.y      # AMR labels, unchanged

Command-Line Interface#

MaldiBatchKit ships a Typer-based CLI organised as maldibatchkit correct <method> plus maldibatchkit diagnose. Each method has its own subcommand with only the flags it actually uses:

# Vanilla Johnson ComBat
maldibatchkit correct combat \
    -i X.csv --batch-csv batch.csv -o X_corrected.csv

# Species-aware preset
maldibatchkit correct species-combat \
    -i X.csv --batch-csv batch.csv --species-csv species.csv \
    -o X_corrected.csv

# Diagnostic report
maldibatchkit diagnose \
    -i X.csv --corrected X_corrected.csv --batch-csv batch.csv \
    --mz-csv mz.csv -o report.csv

See the CLI Reference for the full command tree, NPZ I/O format, and covariate CSV conventions.