Click here to flash read.
Probabilistic predictions can be evaluated through comparisons with observed
label frequencies, that is, through the lens of calibration. Recent scholarship
on algorithmic fairness has started to look at a growing variety of
calibration-based objectives under the name of multi-calibration but has still
remained fairly restricted. In this paper, we explore and analyse forms of
evaluation through calibration by making explicit the choices involved in
designing calibration scores. We organise these into three grouping choices and
a choice concerning the agglomeration of group errors. This provides a
framework for comparing previously proposed calibration scores and helps to
formulate novel ones with desirable mathematical properties. In particular, we
explore the possibility of grouping datapoints based on their input features
rather than on predictions and formally demonstrate advantages of such
approaches. We also characterise the space of suitable agglomeration functions
for group errors, generalising previously proposed calibration scores.
Complementary to such population-level scores, we explore calibration scores at
the individual level and analyse their relationship to choices of grouping. We
draw on these insights to introduce and axiomatise fairness deviation measures
for population-level scores. We demonstrate that with appropriate choices of
grouping, these novel global fairness scores can provide notions of (sub-)group
or individual fairness.