Metrics Module
==============

Batch-aware downstream classifier metrics. Each function computes a
sklearn metric **per batch** on the held-out fold, then aggregates the
per-batch scalars into a single number according to a chosen weighting
scheme. They are the right way to score a corrector when the goal is a
model that performs well *on every site*, not one that just maximises
the pooled metric.

.. contents::
   :local:

Weighting modes
---------------

Every ``batch_*_score`` function accepts ``weights=``:

* ``'uniform'`` - each batch contributes one entry with equal weight
  (``w_i = 1 / n_batches``), regardless of its sample count. The right
  default when the goal is a corrector that generalises across sites.
* ``'balanced'`` - weights inversely proportional to per-batch size
  (``w_i ∝ 1 / n_i``). The **smallest site gets the loudest voice**,
  mirroring sklearn's ``class_weight='balanced'`` formula applied at
  the per-batch level. Use when minority sites are the hardest
  distributions to learn and the corrector must protect them.
* ``'size'`` - per-batch weights proportional to sample counts.
  The relationship to the standard pooled value depends on the
  metric family:

  - **Additive** ``correct / total`` **metrics** (accuracy, error
    rate): ``'size'`` exactly recovers the pooled value.
  - **Class-conditional rates** (sensitivity, specificity): recovering
    the pooled value needs *class-conditional* weights (positives-
    per-batch for sensitivity, negatives-per-batch for specificity),
    not total-sample weights. Equal only when class prevalence is
    constant across batches.
  - **Non-linear metrics** (F1, precision, balanced accuracy, MCC):
    ratios of sums or non-linear functions of confusion-matrix counts.
    No per-batch weighting recovers the pooled value exactly.
  - **Pairwise / rank-based metrics** (AUROC, average precision):
    pooled values count *cross-batch* positive-vs-negative pairs,
    which any per-batch reducer **cannot see by construction** -
    those pairs are gone, not approximated.

  In every non-additive case ``'size'`` still gives the dominant
  site the loudest voice - useful when you want to score the
  classifier as a single-site practitioner would.
* mapping ``{batch_label: weight}`` or array aligned with
  ``np.unique(batch)`` - explicit per-batch weights for any custom
  policy.

Within each batch, the wrapped sklearn metric's own averaging parameter
(``average='binary'``, ``'macro'``, ``'weighted'``, ``'micro'``) is
preserved untouched - ``weights=`` only controls the *across-batch*
reducer.

Degenerate folds (a batch containing only one class for AUROC / average
precision) are **silently dropped from the aggregate with a warning**
and the remaining weights renormalised.

Per-batch metric functions
--------------------------

.. autofunction:: maldibatchkit.metrics.batch_roc_auc_score

.. autofunction:: maldibatchkit.metrics.batch_average_precision_score

.. autofunction:: maldibatchkit.metrics.batch_balanced_accuracy_score

.. autofunction:: maldibatchkit.metrics.batch_matthews_corrcoef

.. autofunction:: maldibatchkit.metrics.batch_f1_score

.. autofunction:: maldibatchkit.metrics.batch_precision_score

.. autofunction:: maldibatchkit.metrics.batch_recall_score

Scorer factory
--------------

:func:`~maldibatchkit.metrics.make_batch_scorer` wraps a per-batch
metric and an aggregation into a sklearn-compatible
``scorer(estimator, X, y)`` callable suitable for
``GridSearchCV(scoring=...)``. It captures the full ``batch`` Series at
factory time and slices it by ``y.index`` for every fold; pass ``y``
as a :class:`pandas.Series` indexed by the sample IDs used by ``X`` and
``batch`` for safe alignment.

.. autofunction:: maldibatchkit.metrics.make_batch_scorer

Example
-------

.. code-block:: python

   from sklearn.linear_model import LogisticRegression
   from sklearn.model_selection import GridSearchCV, StratifiedKFold
   from sklearn.pipeline import Pipeline

   from maldibatchkit import AutoCorrector
   from maldibatchkit.metrics import make_batch_scorer

   # Per-batch AUROC, each site weighted equally.
   scorer = make_batch_scorer(batch, metric="roc_auc", weights="uniform")

   pipe = Pipeline([
       ("correct", AutoCorrector(batch=batch,
                                 discrete_covariates=species)),
       ("clf", LogisticRegression(max_iter=1000)),
   ])

   grid = GridSearchCV(
       pipe,
       param_grid={"correct__method": ["noop", "combat-fortin", "harmony"]},
       scoring=scorer,
       cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0),
   )
   grid.fit(X, y)
   print(grid.best_params_, grid.best_score_)

Comparing ``weights='size'`` (pooled, dominant-site biased) and
``weights='uniform'`` (every site counts equally) frequently picks
**different** correctors. Use the comparison to make the trade-off
explicit before deploying a corrector.