Quickstart Guide ================ This guide walks through the MaldiBatchKit API from a single ComBat call to full train/test workflows, MaldiSet integration, and diagnostics. All correctors expect a feature matrix ``X`` (rows = samples, columns = m/z bins) and a ``batch`` vector aligned to ``X.index``. When ``X`` is a ``pandas.DataFrame``, covariates can be passed as ``Series`` / ``DataFrames`` with matching indices - the corrector realigns them on each ``fit`` and ``transform`` call so the same object is safe inside sklearn pipelines. Vanilla ComBat -------------- Apply Johnson (2007) ComBat to a single harmonisation: .. code-block:: python from maldibatchkit import ComBat corrector = ComBat(batch=batch, method="johnson") X_corrected = corrector.fit_transform(X) Covariate-Aware ComBat (Fortin) ------------------------------- Protect biological covariates with Fortin (2018): .. code-block:: python from maldibatchkit import ComBat corrector = ComBat( batch=batch, discrete_covariates=species, continuous_covariates=age, method="fortin", ) X_corrected = corrector.fit_transform(X) Species-Aware Preset -------------------- :class:`~maldibatchkit.SpeciesAwareComBat` is a convenience wrapper for the common case of "Fortin with species as the protected covariate": .. code-block:: python from maldibatchkit import SpeciesAwareComBat corrector = SpeciesAwareComBat(batch=batch, species=species) X_corrected = corrector.fit_transform(X) Quality-Weighted ComBat ----------------------- Weight samples by a per-sample quality score (typically SNR) so low-quality spectra contribute less to the shrinkage prior: .. code-block:: python from maldibatchkit import QualityWeightedComBat corrector = QualityWeightedComBat( batch=batch, quality=snr, parametric=True, max_iter=30, ) X_corrected = corrector.fit_transform(X) Train/Test Without Leakage -------------------------- Every corrector follows the standard sklearn ``fit`` / ``transform`` split - call ``fit`` on the training data only, then ``transform`` each split with the fitted parameters: .. code-block:: python from sklearn.model_selection import train_test_split from maldibatchkit import ComBat X_train, X_test, y_train, y_test = train_test_split( X, y, stratify=batch ) corrector = ComBat( batch=batch, method="fortin", discrete_covariates=species, ) corrector.fit(X_train) # learns on train only X_train_c = corrector.transform(X_train) X_test_c = corrector.transform(X_test) # same parameters on test ``batch`` is indexed by the same sample IDs as ``X``, so the corrector slices the right subset on each call. Inside an Sklearn Pipeline -------------------------- All correctors expose ``fit`` / ``transform`` / ``fit_transform`` and are drop-in pipeline steps: .. code-block:: python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score from maldibatchkit import ComBat pipe = Pipeline([ ("combat", ComBat(batch=batch, method="fortin", discrete_covariates=species)), ("scaler", StandardScaler()), ("clf", RandomForestClassifier(n_estimators=200)), ]) scores = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc") print(f"AUROC: {scores.mean():.3f} +/- {scores.std():.3f}") Linear-Model Subtraction (Limma) -------------------------------- ``Limma`` is a pure-Python implementation of ``limma::removeBatchEffect`` that subtracts an OLS fit of the batch indicators while protecting the columns of ``design``: .. code-block:: python from maldibatchkit import Limma corrector = Limma(batch=batch, design=species_design) X_corrected = corrector.fit_transform(X) Harmony ------- :class:`~maldibatchkit.Harmony` wraps ``harmonypy`` for iterative soft-clustering integration. Like the ``harmonypy`` reference, the corrector is fit-transform only (no clean test-time application): .. code-block:: python from maldibatchkit import Harmony corrector = Harmony( batch=batch, covariates=extra_vars, theta=2.0, max_iter=20, random_state=0, ) X_corrected = corrector.fit_transform(X) Simple Baselines ---------------- Auditable, no-covariate alternatives useful for sanity checks: .. code-block:: python from maldibatchkit import MedianCentering, ZScorePerBatch, ReferenceScaling median = MedianCentering(batch=batch).fit_transform(X) zscored = ZScorePerBatch(batch=batch).fit_transform(X) ref = ReferenceScaling(batch=batch, reference_batch="site_A").fit_transform(X) MALDI-Specific: Batch-Aware Warping ----------------------------------- :class:`~maldibatchkit.BatchAwareWarping` is a MALDI-specific corrector that warps each batch onto a shared global reference before any intensity-domain step: .. code-block:: python from maldibatchkit import BatchAwareWarping warper = BatchAwareWarping( batch=batch, reference="median", method="piecewise", n_segments=8, max_shift=10, n_jobs=-1, ) X_warped = warper.fit_transform(X) It is safe to chain with a downstream intensity-domain corrector: .. code-block:: python from sklearn.pipeline import Pipeline pipe = Pipeline([ ("warp", BatchAwareWarping(batch=batch, method="piecewise")), ("combat", ComBat(batch=batch, method="fortin", discrete_covariates=species)), ]) X_corrected = pipe.fit_transform(X) Diagnostics ----------- The :mod:`~maldibatchkit.diagnostics` subpackage provides generic batch-mixing metrics (kBET, LISI, silhouette) and MALDI-specific summaries (per-batch peak drift, TIC CoV, spectrum count): .. code-block:: python from maldibatchkit.diagnostics import ( silhouette_batch, kbet, lisi, peak_position_drift, tic_cov_per_batch, diagnostic_report, ) # Individual metrics sil = silhouette_batch(X_corrected, batch) kbet_stats = kbet(X_corrected, batch) lisi_mean = lisi(X_corrected, batch, perplexity=30.0) # One-shot before/after summary report = diagnostic_report( X, X_corrected, batch, mz_values=mz, top_k_peaks=40, ) print(report) The report is a tidy DataFrame with ``metric`` / ``scope`` / ``value_before`` / ``value_after`` / ``delta`` columns - suitable for further pandas manipulation or for the bundled plotting helpers. Visualization ------------- .. code-block:: python from maldibatchkit.viz import ( plot_batch_umap, plot_peak_shift, plot_diagnostic_summary, ) # Side-by-side UMAP (requires maldibatchkit[viz]) plot_batch_umap(X, X_corrected, batch, random_state=0) # Per-batch peak overlays against a reference spectrum plot_peak_shift(batch, X_corrected, mz_values=mz) # Before/after bar chart built from a diagnostic_report plot_diagnostic_summary(report, scope="overall") MaldiSet Integration -------------------- :class:`~maldibatchkit.integrations.MaldiSetAdapter` lets you correct a :class:`maldiamrkit.MaldiSet` directly, returning a new ``MaldiSet`` whose ``X`` is harmonised and whose AMR labels / metadata are untouched: .. code-block:: python from maldiamrkit import MaldiSet from maldibatchkit.integrations import MaldiSetAdapter from maldibatchkit import SpeciesAwareComBat ds = MaldiSet.from_directory( "spectra/", "metadata.csv", aggregate_by=dict(antibiotics="Ceftriaxone"), ) adapter = MaldiSetAdapter( batch_column="Batch", species_column="Species", quality_column="SNR", ) corrected_ds = adapter.correct(ds, SpeciesAwareComBat) corrected_ds.X # harmonised feature matrix corrected_ds.y # AMR labels, unchanged Command-Line Interface ---------------------- MaldiBatchKit ships a Typer-based CLI organised as ``maldibatchkit correct `` plus ``maldibatchkit diagnose``. Each method has its own subcommand with only the flags it actually uses: .. code-block:: bash # Vanilla Johnson ComBat maldibatchkit correct combat \ -i X.csv --batch-csv batch.csv -o X_corrected.csv # Species-aware preset maldibatchkit correct species-combat \ -i X.csv --batch-csv batch.csv --species-csv species.csv \ -o X_corrected.csv # Diagnostic report maldibatchkit diagnose \ -i X.csv --corrected X_corrected.csv --batch-csv batch.csv \ --mz-csv mz.csv -o report.csv See the :doc:`CLI Reference ` for the full command tree, NPZ I/O format, and covariate CSV conventions.