Click here to flash read.
Estimating the underlying distribution from \textit{iid} samples is a
classical and important problem in statistics. When the alphabet size is large
compared to number of samples, a portion of the distribution is highly likely
to be unobserved or sparsely observed. The missing mass, defined as the sum of
probabilities $\text{Pr}(x)$ over the missing letters $x$, and the Good-Turing
estimator for missing mass have been important tools in large-alphabet
distribution estimation. In this article, given a positive function $g$ from
$[0,1]$ to the reals, the missing $g$-mass, defined as the sum of
$g(\text{Pr}(x))$ over the missing letters $x$, is introduced and studied. The
missing $g$-mass can be used to investigate the structure of the missing part
of the distribution. Specific applications for special cases such as
order-$\alpha$ missing mass ($g(p)=p^{\alpha}$) and the missing Shannon entropy
($g(p)=-p\log p$) include estimating distance from uniformity of the missing
distribution and its partial estimation. Minimax estimation is studied for
order-$\alpha$ missing mass for integer values of $\alpha$ and exact minimax
convergence rates are obtained. Concentration is studied for a class of
functions $g$ and specific results are derived for order-$\alpha$ missing mass
and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal
worst-case variance factors are derived. Two new notions of concentration,
named strongly sub-Gamma and filtered sub-Gaussian concentration, are
introduced and shown to result in right tail bounds that are better than those
obtained from sub-Gaussian concentration.
No creative common's license