Metrics Module#
Batch-aware downstream classifier metrics. Each function computes a sklearn metric per batch on the held-out fold, then aggregates the per-batch scalars into a single number according to a chosen weighting scheme. They are the right way to score a corrector when the goal is a model that performs well on every site, not one that just maximises the pooled metric.
Weighting modes#
Every batch_*_score function accepts weights=:
'uniform'- each batch contributes one entry with equal weight (w_i = 1 / n_batches), regardless of its sample count. The right default when the goal is a corrector that generalises across sites.'balanced'- weights inversely proportional to per-batch size (w_i ∝ 1 / n_i). The smallest site gets the loudest voice, mirroring sklearn’sclass_weight='balanced'formula applied at the per-batch level. Use when minority sites are the hardest distributions to learn and the corrector must protect them.'size'- per-batch weights proportional to sample counts. The relationship to the standard pooled value depends on the metric family:Additive
correct / totalmetrics (accuracy, error rate):'size'exactly recovers the pooled value.Class-conditional rates (sensitivity, specificity): recovering the pooled value needs class-conditional weights (positives- per-batch for sensitivity, negatives-per-batch for specificity), not total-sample weights. Equal only when class prevalence is constant across batches.
Non-linear metrics (F1, precision, balanced accuracy, MCC): ratios of sums or non-linear functions of confusion-matrix counts. No per-batch weighting recovers the pooled value exactly.
Pairwise / rank-based metrics (AUROC, average precision): pooled values count cross-batch positive-vs-negative pairs, which any per-batch reducer cannot see by construction - those pairs are gone, not approximated.
In every non-additive case
'size'still gives the dominant site the loudest voice - useful when you want to score the classifier as a single-site practitioner would.mapping
{batch_label: weight}or array aligned withnp.unique(batch)- explicit per-batch weights for any custom policy.
Within each batch, the wrapped sklearn metric’s own averaging parameter
(average='binary', 'macro', 'weighted', 'micro') is
preserved untouched - weights= only controls the across-batch
reducer.
Degenerate folds (a batch containing only one class for AUROC / average precision) are silently dropped from the aggregate with a warning and the remaining weights renormalised.
Per-batch metric functions#
- maldibatchkit.metrics.batch_roc_auc_score(y_true, y_score, *, batch, weights='uniform', **kwargs)[source]#
Per-batch AUROC, aggregated with
weights.- Parameters:
y_true (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Ground-truth binary labels.y_score (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Probability of the positive class (or any monotone score).batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Per-sample batch labels.weights (
Union[Literal['uniform','balanced','size'],Mapping[Any,float],ndarray,list]) – See module docstring.**kwargs (
Any) – Forwarded tosklearn.metrics.roc_auc_score()(average,multi_class, …).
- Return type:
- maldibatchkit.metrics.batch_average_precision_score(y_true, y_score, *, batch, weights='uniform', **kwargs)[source]#
Per-batch average precision (PR-AUC), aggregated with
weights.- Parameters:
y_true (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])y_score (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])weights (
Union[Literal['uniform','balanced','size'],Mapping[Any,float],ndarray,list])kwargs (
Any)
- Return type:
- maldibatchkit.metrics.batch_balanced_accuracy_score(y_true, y_pred, *, batch, weights='uniform', **kwargs)[source]#
Per-batch balanced accuracy, aggregated with
weights.- Parameters:
y_true (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])y_pred (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])weights (
Union[Literal['uniform','balanced','size'],Mapping[Any,float],ndarray,list])kwargs (
Any)
- Return type:
- maldibatchkit.metrics.batch_matthews_corrcoef(y_true, y_pred, *, batch, weights='uniform')[source]#
Per-batch Matthews correlation coefficient, aggregated with
weights.MCC is in
[-1, 1]; 0 means random prediction. A batch wherey_predis constant or only one class is present yields MCC = 0 and is not considered degenerate (sklearn returns 0 with a warning) – this differs from AUROC where the metric is undefined.- Parameters:
- Return type:
- maldibatchkit.metrics.batch_f1_score(y_true, y_pred, *, batch, weights='uniform', average='binary', **kwargs)[source]#
Per-batch F1 score, aggregated with
weights.- Parameters:
average (
str|None) – Forwarded tosklearn.metrics.f1_score(). Controls how the F1 is averaged within each batch’s class set ('binary','macro','weighted','micro'). The across-batch reduction is controlled byweights=.y_true (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])y_pred (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])weights (
Union[Literal['uniform','balanced','size'],Mapping[Any,float],ndarray,list])kwargs (
Any)
- Return type:
- maldibatchkit.metrics.batch_precision_score(y_true, y_pred, *, batch, weights='uniform', average='binary', pos_label=1, **kwargs)[source]#
Per-batch precision, aggregated with
weights.Set
pos_label=0(oraverage='macro'/'micro'/'weighted') for negative-class or aggregated precision; seesklearn.metrics.precision_score().- Parameters:
y_true (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])y_pred (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])weights (
Union[Literal['uniform','balanced','size'],Mapping[Any,float],ndarray,list])kwargs (
Any)
- Return type:
- maldibatchkit.metrics.batch_recall_score(y_true, y_pred, *, batch, weights='uniform', average='binary', pos_label=1, **kwargs)[source]#
Per-batch recall, aggregated with
weights.- Parameters:
y_true (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])y_pred (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]])weights (
Union[Literal['uniform','balanced','size'],Mapping[Any,float],ndarray,list])kwargs (
Any)
- Return type:
Scorer factory#
make_batch_scorer() wraps a per-batch
metric and an aggregation into a sklearn-compatible
scorer(estimator, X, y) callable suitable for
GridSearchCV(scoring=...). It captures the full batch Series at
factory time and slices it by y.index for every fold; pass y
as a pandas.Series indexed by the sample IDs used by X and
batch for safe alignment.
- maldibatchkit.metrics.make_batch_scorer(batch, metric='roc_auc', *, weights='uniform', response_method=None, greater_is_better=True, pos_class_index=1, **metric_kwargs)[source]#
Return a sklearn-compatible
scorer(estimator, X, y).- Parameters:
batch (
DataFrame|Series|ndarray[tuple[Any,...],dtype[Any]]) – Per-sample batch labels for the full dataset. The scorer slices this byy.indexon each fold; pass apandas.Seriesindexed by the sample IDs used byXandyfor safe alignment.metric (
str|Callable[...,float]) – Registered alias ('roc_auc','average_precision','balanced_accuracy','mcc'/'matthews_corrcoef','f1','precision','recall'), or a callable with the same signature as thebatch_*_scorefunctions ((y_true, y_pred_or_score, *, batch, weights, **kw)).weights (
Union[Literal['uniform','balanced','size'],Mapping[Any,float],ndarray,list]) – Per-batch aggregation weighting (see module docstring).response_method (
str|None) – How the scorer should produce predictions from the estimator. Defaults to the metric’s natural choice ('predict_proba'for AUROC / AP,'predict'for the rest). Required whenmetricis a callable.greater_is_better (
bool) – If False, the returned scorer negates the metric so larger is always better (sklearn convention).pos_class_index (
int) – Column index inpredict_proba(...)to use as the positive score whenresponse_method='predict_proba'.**metric_kwargs (
Any) – Forwarded verbatim to the underlying metric (e.g.average='macro',pos_label=0).
- Returns:
scorer –
scorer(estimator, X, y)-> float, signed bygreater_is_better.- Return type:
Examples
>>> from sklearn.model_selection import GridSearchCV >>> from sklearn.pipeline import Pipeline >>> from sklearn.linear_model import LogisticRegression >>> from maldibatchkit import AutoCorrector >>> from maldibatchkit.metrics import make_batch_scorer >>> >>> scorer = make_batch_scorer(batch=batch, metric='roc_auc', ... weights='uniform') >>> grid = GridSearchCV( ... Pipeline([ ... ('correct', AutoCorrector(batch=batch, ... discrete_covariates=species)), ... ('clf', LogisticRegression()), ... ]), ... param_grid={'correct__method': ['noop', 'combat-fortin', 'harmony']}, ... scoring=scorer, ... )
Example#
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from maldibatchkit import AutoCorrector
from maldibatchkit.metrics import make_batch_scorer
# Per-batch AUROC, each site weighted equally.
scorer = make_batch_scorer(batch, metric="roc_auc", weights="uniform")
pipe = Pipeline([
("correct", AutoCorrector(batch=batch,
discrete_covariates=species)),
("clf", LogisticRegression(max_iter=1000)),
])
grid = GridSearchCV(
pipe,
param_grid={"correct__method": ["noop", "combat-fortin", "harmony"]},
scoring=scorer,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0),
)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_)
Comparing weights='size' (pooled, dominant-site biased) and
weights='uniform' (every site counts equally) frequently picks
different correctors. Use the comparison to make the trade-off
explicit before deploying a corrector.