Metrics Module#

Batch-aware downstream classifier metrics. Each function computes a sklearn metric per batch on the held-out fold, then aggregates the per-batch scalars into a single number according to a chosen weighting scheme. They are the right way to score a corrector when the goal is a model that performs well on every site, not one that just maximises the pooled metric.

Weighting modes #

Every batch_*_score function accepts weights=:

'uniform' - each batch contributes one entry with equal weight (w_i = 1 / n_batches), regardless of its sample count. The right default when the goal is a corrector that generalises across sites.
'balanced' - weights inversely proportional to per-batch size (w_i ∝ 1 / n_i). The smallest site gets the loudest voice, mirroring sklearn’s class_weight='balanced' formula applied at the per-batch level. Use when minority sites are the hardest distributions to learn and the corrector must protect them.
'size' - per-batch weights proportional to sample counts. The relationship to the standard pooled value depends on the metric family:
- Additive correct / total metrics (accuracy, error rate): 'size' exactly recovers the pooled value.
- Class-conditional rates (sensitivity, specificity): recovering the pooled value needs class-conditional weights (positives- per-batch for sensitivity, negatives-per-batch for specificity), not total-sample weights. Equal only when class prevalence is constant across batches.
- Non-linear metrics (F1, precision, balanced accuracy, MCC): ratios of sums or non-linear functions of confusion-matrix counts. No per-batch weighting recovers the pooled value exactly.
- Pairwise / rank-based metrics (AUROC, average precision): pooled values count cross-batch positive-vs-negative pairs, which any per-batch reducer cannot see by construction - those pairs are gone, not approximated.
In every non-additive case 'size' still gives the dominant site the loudest voice - useful when you want to score the classifier as a single-site practitioner would.
mapping {batch_label: weight} or array aligned with np.unique(batch) - explicit per-batch weights for any custom policy.

Within each batch, the wrapped sklearn metric’s own averaging parameter (average='binary', 'macro', 'weighted', 'micro') is preserved untouched - weights= only controls the across-batch reducer.

Degenerate folds (a batch containing only one class for AUROC / average precision) are silently dropped from the aggregate with a warning and the remaining weights renormalised.

Per-batch metric functions #

maldibatchkit.metrics.batch_roc_auc_score(y_true, y_score, *, batch, weights='uniform', **kwargs)[source]#

Per-batch AUROC, aggregated with weights.

Parameters:

y_true (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Ground-truth binary labels.
y_score (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Probability of the positive class (or any monotone score).
batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Per-sample batch labels.
weights (Union[Literal['uniform', 'balanced', 'size'], Mapping[Any, float], ndarray, list]) – See module docstring.
**kwargs (Any) – Forwarded to sklearn.metrics.roc_auc_score() (average, multi_class, …).

Return type:

float

maldibatchkit.metrics.batch_average_precision_score(y_true, y_score, *, batch, weights='uniform', **kwargs)[source]#

Per-batch average precision (PR-AUC), aggregated with weights.

Parameters:

y_true (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
y_score (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
weights (Union[Literal['uniform', 'balanced', 'size'], Mapping[Any, float], ndarray, list])
kwargs (Any)

Return type:

float

maldibatchkit.metrics.batch_balanced_accuracy_score(y_true, y_pred, *, batch, weights='uniform', **kwargs)[source]#

Per-batch balanced accuracy, aggregated with weights.

Parameters:

y_true (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
y_pred (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
weights (Union[Literal['uniform', 'balanced', 'size'], Mapping[Any, float], ndarray, list])
kwargs (Any)

Return type:

float

maldibatchkit.metrics.batch_matthews_corrcoef(y_true, y_pred, *, batch, weights='uniform')[source]#

Per-batch Matthews correlation coefficient, aggregated with weights.

MCC is in [-1, 1]; 0 means random prediction. A batch where y_pred is constant or only one class is present yields MCC = 0 and is not considered degenerate (sklearn returns 0 with a warning) – this differs from AUROC where the metric is undefined.

Parameters:

y_true (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
y_pred (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
weights (Union[Literal['uniform', 'balanced', 'size'], Mapping[Any, float], ndarray, list])

Return type:

float

maldibatchkit.metrics.batch_f1_score(y_true, y_pred, *, batch, weights='uniform', average='binary', **kwargs)[source]#

Per-batch F1 score, aggregated with weights.

Parameters:

average (str | None) – Forwarded to sklearn.metrics.f1_score(). Controls how the F1 is averaged within each batch’s class set ('binary', 'macro', 'weighted', 'micro'). The across-batch reduction is controlled by weights=.
y_true (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
y_pred (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
weights (Union[Literal['uniform', 'balanced', 'size'], Mapping[Any, float], ndarray, list])
kwargs (Any)

Return type:

float

maldibatchkit.metrics.batch_precision_score(y_true, y_pred, *, batch, weights='uniform', average='binary', pos_label=1, **kwargs)[source]#

Per-batch precision, aggregated with weights.

Set pos_label=0 (or average='macro' / 'micro' / 'weighted') for negative-class or aggregated precision; see sklearn.metrics.precision_score().

Parameters:

y_true (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
y_pred (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
weights (Union[Literal['uniform', 'balanced', 'size'], Mapping[Any, float], ndarray, list])
average (str | None)
pos_label (int | str)
kwargs (Any)

Return type:

float

maldibatchkit.metrics.batch_recall_score(y_true, y_pred, *, batch, weights='uniform', average='binary', pos_label=1, **kwargs)[source]#

Per-batch recall, aggregated with weights.

Parameters:

y_true (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
y_pred (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]])
weights (Union[Literal['uniform', 'balanced', 'size'], Mapping[Any, float], ndarray, list])
average (str | None)
pos_label (int | str)
kwargs (Any)

Return type:

float

Scorer factory #

make_batch_scorer() wraps a per-batch metric and an aggregation into a sklearn-compatible scorer(estimator, X, y) callable suitable for GridSearchCV(scoring=...). It captures the full batch Series at factory time and slices it by y.index for every fold; pass y as a pandas.Series indexed by the sample IDs used by X and batch for safe alignment.

maldibatchkit.metrics.make_batch_scorer(batch, metric='roc_auc', *, weights='uniform', response_method=None, greater_is_better=True, pos_class_index=1, **metric_kwargs)[source]#

Return a sklearn-compatible scorer(estimator, X, y).

Parameters:

batch (DataFrame | Series | ndarray[tuple[Any, ...], dtype[Any]]) – Per-sample batch labels for the full dataset. The scorer slices this by y.index on each fold; pass a pandas.Series indexed by the sample IDs used by X and y for safe alignment.
metric (str | Callable[..., float]) – Registered alias ('roc_auc', 'average_precision', 'balanced_accuracy', 'mcc' / 'matthews_corrcoef', 'f1', 'precision', 'recall'), or a callable with the same signature as the batch_*_score functions ((y_true, y_pred_or_score, *, batch, weights, **kw)).
weights (Union[Literal['uniform', 'balanced', 'size'], Mapping[Any, float], ndarray, list]) – Per-batch aggregation weighting (see module docstring).
response_method (str | None) – How the scorer should produce predictions from the estimator. Defaults to the metric’s natural choice ('predict_proba' for AUROC / AP, 'predict' for the rest). Required when metric is a callable.
greater_is_better (bool) – If False, the returned scorer negates the metric so larger is always better (sklearn convention).
pos_class_index (int) – Column index in predict_proba(...) to use as the positive score when response_method='predict_proba'.
**metric_kwargs (Any) – Forwarded verbatim to the underlying metric (e.g. average='macro', pos_label=0).

Returns:

scorer – scorer(estimator, X, y) -> float, signed by greater_is_better.

Return type:

Callable[..., float]

Examples

>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.linear_model import LogisticRegression
>>> from maldibatchkit import AutoCorrector
>>> from maldibatchkit.metrics import make_batch_scorer
>>>
>>> scorer = make_batch_scorer(batch=batch, metric='roc_auc',
...                            weights='uniform')
>>> grid = GridSearchCV(
...     Pipeline([
...         ('correct', AutoCorrector(batch=batch,
...                                   discrete_covariates=species)),
...         ('clf', LogisticRegression()),
...     ]),
...     param_grid={'correct__method': ['noop', 'combat-fortin', 'harmony']},
...     scoring=scorer,
... )

Example #

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline

from maldibatchkit import AutoCorrector
from maldibatchkit.metrics import make_batch_scorer

# Per-batch AUROC, each site weighted equally.
scorer = make_batch_scorer(batch, metric="roc_auc", weights="uniform")

pipe = Pipeline([
    ("correct", AutoCorrector(batch=batch,
                              discrete_covariates=species)),
    ("clf", LogisticRegression(max_iter=1000)),
])

grid = GridSearchCV(
    pipe,
    param_grid={"correct__method": ["noop", "combat-fortin", "harmony"]},
    scoring=scorer,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0),
)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_)

Comparing weights='size' (pooled, dominant-site biased) and weights='uniform' (every site counts equally) frequently picks different correctors. Use the comparison to make the trade-off explicit before deploying a corrector.

Metrics Module#

Weighting modes#

Per-batch metric functions#

Scorer factory#

Example#

Weighting modes #

Per-batch metric functions #

Scorer factory #

Example #