## Abstract

High sampling variability complicates estimation of demographic rates in small areas. In addition, many countries have imperfect vital registration systems, with coverage quality that varies significantly between regions. We develop a Bayesian regression model for small-area mortality schedules that simultaneously addresses the problems of small local samples and underreporting of deaths. We combine a relational model for mortality schedules with probabilistic prior information on death registration coverage derived from demographic estimation techniques, such as Death Distribution Methods, and from field audits by public health experts. We test the model on small-area data from Brazil. Incorporating external estimates of vital registration coverage though priors improves small-area mortality estimates by accounting for underregistration and automatically producing measures of uncertainty. Bayesian estimates show that when mortality levels in small areas are compared, noise often dominates signal. Differences in local point estimates of life expectancy are often small relative to uncertainty, even for relatively large areas in a populous country like Brazil.

## Introduction

Small-area mortality estimation is a challenge for demographers and public health researchers for two main reasons. First, there is the universal problem of small populations and high sampling variability in recorded deaths (Bernardinelli and Montomoli 1992; Pletcher 1999; Riggan et al. 1991). With low mortality rates and short periods of exposure, observed event/exposure ratios are very unstable, and estimation of mortality patterns is difficult. In such situations models must fill the gap: smoothing procedures that use known, robust patterns in rate schedules must be combined with available data (e.g., Brass 1971; Wilmoth et al. 2012).

Second, in many countries or regions, vital registration is incomplete, and some deaths go unrecorded in official statistics (Frias et al. 2013; Mathers et al. 2005). Demographers have proposed a variety of methods to estimate the completeness of death registration and adjust mortality estimates accordingly (Bennett and Horiuchi 1981, 1984; Brass 1975; Hill 1987; 2007; Hill et al. 2009; Preston and Hill 1980; Preston et al. 1980; Queiroz et al. 2013, 2017). However, most methods depend on approximate stability of the population’s sex and age composition (Bhat 2002; Murray et al. 2010), an assumption that is not met in countries that have experienced recent, rapid demographic transitions. Migration between subnational areas can also make the methods’ stability assumptions unlikely (Bhat 2002; Bignami-Van Assche 2005; Hill and Queiroz 2010). Finally, standard methods cannot provide uncertainty measures about the completeness of death records (Murray et al. 2010).

Demographers and statistical epidemiologists have improved models of small-area mortality schedules significantly in recent years. New approaches based on Bayesian models that “borrow strength” over different dimensions (age, time, space, and/or sex) allow estimation of life expectancies and mortality rates in regions with sparse, high-quality data (Alexander et al. 2017; Congdon 2009; Jonker et al. 2012; Ocaña-Riola and Mayoral-Cortés 2010; Stephens et al. 2013; Tsimbos et al. 2014).

Although it is not directly related to small-area estimation, a growing literature has examined the use of Bayesian approaches in international demographic forecasting (e.g., Gerland et al. 2014; Raftery et al. 2013; 2014; Ševčíková et al. 2016) and in methods for constructing coherent sets of estimates over large numbers of administrative units (e.g., Alkema et al. 2011, 2013; You et al. 2015). The methods used in these studies have important features in common with our approach—in particular, partial pooling of information across related observational units and incorporating information from external sources through prior distributions for parameters.

Statisticians have also addressed estimation in cases of incomplete or underreported data. Raftery (1988) proposed a Bayesian approach to the general problem of inferring the number of binomial trials from the number of successes. Moreno and Girón (1998) developed a model for estimating crime rates from imperfectly reported data, which is closely related to the model for mortality that we propose in this article. Very recently, de Oliveira et al. (2017) analyzed Brazilian infant mortality data with a model in which death reports may be censored in some geographic regions.

In this article, we propose a Bayesian regression model that specifically addresses the fundamental problems in small-area mortality estimation in countries with potentially defective registration. The model smooths age-specific mortality rates in small samples while also accounting for uncertainty about the level of death registration. Bayesian regression produces estimates of small-area mortality rates and life expectancies, and of the uncertainty in those estimates.

Our model offers several advantages over existing non-Bayesian approaches to estimating age-specific mortality schedules in small areas with vital registration errors. It incorporates two sources of uncertainty: sampling variability and uncertainty about registration coverage. In addition, it uses a novel functional form for mortality schedules that stabilizes small-sample estimates without requiring strong assumptions about age-specific mortality patterns. Finally, estimated rate schedules from the model are continuous, smooth functions, unlike many existing corrections for underregistration that require different methods for infant, child, and adult deaths and may consequently have discontinuities in the estimated rate schedule.

Prior distributions for death coverage can include any available empirical information related to local data quality. In principle, prior information could include expert opinion, demographic estimates, or both. In our specific application to Brazilian data, prior information on age- and sex-specific vital registration comes mainly from standard demographic estimates and an intensive field audit conducted by public health researchers.

## Modeling Strategy

### A Simple Statistical Model for Deaths and Registered Deaths

For each age *x* within an area of interest, we observe exposure (*N*_{x}) and *registered* deaths (*R*_{x}). The *true* number of deaths, *D*_{x}*≥ R*_{x}, is not observed. We assume that true deaths have independent Poisson distributions at each age that depend on age-specific mortality rates (μ_{x}):

*x*has a binomial distribution:

*R*

_{x}) implied by Eqs. (1) and (2) is

^{1}

### Identifiability of Mortality Rates

A distribution *R* ∼ *Poisson*(*N*μπ) for registered deaths implies that the mortality rate is not identifiable from data on *R* and *N* because all (μ, π) pairs that have the same product will have identical likelihoods *L*(*R|N*, μ, π) ∝ *e*^{−Nμπ}(*N*μπ)^{R}. In other words, from the likelihood alone, one cannot distinguish between situations of (high mortality, low registration) and situations of (low mortality, high registration).

In a classical frequentist approach to mortality estimation, this lack of identifiability is fatal. Unless the coverage probability π is known, there is no unique μ that maximizes the likelihood, and it is impossible to estimate the mortality rate from *R* and *N*. In contrast, a Bayesian approach allows the analyst to use probabilistic information about which coverage probabilities are more likely (expressed as a prior distribution, π ∼ *f*_{π}) to produce probabilistic statements about mortality rates μ, given *R* and *N*.

As an intuitive example, consider an age group in which we observe *N* = 1,000 person-years of exposure and *R* = 10 registered deaths. Suppose that a small field audit in this location, conducted five years earlier, found that 12 out of 15 total deaths had been registered. From the field audit information, it is reasonable to use π ∼ *Beta*(12, 3) as an expression of our prior knowledge about local death registration (Lynch 2007:54–57). As seen in the left panel of Fig. 1, this prior implies that the expected registration probability is π = 0.80: there is an 80 % probability that coverage is in the [0.66, 0.92] range indicated by the solid bar, and so forth.

In a Bayesian approach, the mortality rate (μ) is treated as an uncertain value with a statistical distribution, and the prior for π adds probabilistic information about coverage that allows us to infer which mortality rates are more and less likely, given exposure and a count of registered deaths. Based on the field audit information in our example, π = 0.8 is a very likely coverage level, and π = 0.4 is very unlikely. Therefore, *a posteriori* (i.e., after observing *R* = 10 and *N* = 1,000), one can say that μ = (10 / 1,000) / 0.8 = .0125 is a very plausible mortality rate, while μ = (10 / 1,000) / 0.4 = .0250 is much less plausible.

*R*with the prior for π, producing a posterior distribution for μ that summarizes which mortality rates are more plausible and which are less plausible given the observed data:

^{2}

In our specific example, with likelihood *R* ∼ *Poisson*(*N*μπ), prior π ∼ *Beta*(12, 3), and data (*R*, *N*) = (10, 1,000), this distribution is

which appears in the right panel of Fig. 1: the posterior median of the mortality rate is .014, and an 80 % credible interval (10th to 90th percentile) is [.009, .021]. In the full model, we use a parametric relational system for log mortality rate schedules, as described in the next section. However, the main principle is illustrated by this simple example: we combine probabilistic prior knowledge about death registration with a statistical model that relates registered deaths, coverage, and exposure. The result is an *a posteriori* distribution for local mortality rates.

## TOPALS Relational Model for Mortality Schedules

We model mortality by age with the TOPALS relational model (de Beer 2012; Gonzaga and Schmertmann 2016). In a TOPALS model, the log mortality schedule is a sum of two functions: (1) a constant schedule (called the *standard*) that incorporates basic age patterns, and (2) a parametric, piecewise-linear function made up of straight-line segments between designated ages (these ages are called *knots*) that represents differences between the standard and the log mortality schedule in the population of interest.

*x*= 0

*. . .*99,

**λ ∈**

*ℝ*

^{100}in the TOPALS model is

**λ**

^{*}

**∈**

*ℝ*

^{100}is the standard (in our case, derived from national data for Brazil in 2010),

**B**is a 100 × 7 matrix of fixed B-spline linear basis functions (de Boor 2001), and

**α ∈**

*ℝ*

^{7}is a parameter vector.

The seven model parameters **α** = (α_{0} . . . α_{6})′ are the values of the piecewise-linear function at exact ages 0, 1, 10, 20, 40, 70, and 100. For example, $\lambda 40=\lambda 40\u2217+\alpha 4$ and $\lambda 70=\lambda 70\u2217+\alpha 5$. Between knots, the additive offsets to the standard schedule change linearly with age—for example, $\lambda 50=\lambda 50\u2217+23\alpha 4+13\alpha 5$, $\lambda 55=\lambda 55\u2217+12\alpha 4+12\alpha 5$, and $\lambda 60=\lambda 60\u2217+13\alpha 4+23\alpha 5$. The space of possible mortality schedules in a TOPALS model is thus the set of curves that can be constructed by adding piecewise-linear functions to the standard log-rate schedule.^{3}

*x*in a TOPALS model is

*x*th row of

**B**. Under the distributional assumptions outlined earlier, the log likelihood is

where *c* is a constant that does not vary with **α**.

Figure 2 illustrates a fitted TOPALS model for males in the northern Brazilian state of Amapá in 2010. The figure shows a national standard **λ**^{∗} for males (thin black line), observed ln(*R*_{x} / *N*_{x}) values from registered deaths and exposure in Amapá (open circles), the seven TOPALS offset parameters α_{0} . . . α_{6} (the vertical positions of the solid dots), the linear spline offsets **Bα** (thin line connecting the offsets), and the fitted TOPALS model schedule (standard + spline, **λ**^{*} + **Bα**, thick line). In broad terms, this fit suggests that Amapá’s male mortality is higher than the standard at ages below 25 and lower than the standard at ages above 25. More subtly, the fit suggests that downward deviations from the standard are slightly larger at higher ages.

This particular fit for Amapá maximizes Eq. (6) over **α** under the assumption of 100 % death registration ($\pi x$ = 1, *R*_{x} = *D*_{x}) at all ages. It serves only to illustrate the components of a parametric TOPALS model. In practice, we relax the assumption of complete coverage, replacing it with prior distributions for $\pi x$.

## Model Summary

Figure 3 summarizes our statistical approach to estimating mortality and life expectancy in a small area. We use information from other sources to develop priors for age-specific coverage. (In the hypothetical example above, for instance, a field audit suggested a beta distribution with *a* = 12 and *b* = 3). As indicated in the top right of Fig. 3, we use a weak multivariate normal prior on **α** (described in detail later) to stabilize schedule estimates in very small populations.

Age-specific exposure (*N*_{0} . . . *N*_{99}) and registered deaths (*R*_{0} . . . *R*_{99}) are observed. The model combines the prior distributions for coverage and mortality parameters (*f*_{π} and *f*_{α}) with the Poisson likelihood *L*(**R**|**N**, **α**, **π**) = ∏_{x}*L*(*R*_{x}|*N*_{x}, μ_{x}(**α**), $\pi x$) to produce a posterior marginal distribution for **α =** (α_{0} . . . α_{6})′, similar to Eq. (4):

*e*

_{0}, via the simulated posterior distribution of $e0\alpha 1\u2217\u2026e0\alpha K\u2217$:

*I*[ ] is a (0,1) indicator function equal to 1 if the condition in brackets is true.

### Data

We apply the model to small-area estimation in Brazil, which has good census data but incomplete vital registration. Death reporting has improved significantly over the past several decades, but large regional differences in the quality of coverage remain (de Mello Jorge et al. 2007; Instituto Brasileiro de Geografia e Estatística 2013; Paes 2005; Queiroz et al. 2017). Fine-grained mortality (and especially life expectancy) estimates are in increasing demand, but the complications of small samples and differential coverage make this a difficult task. In short, Brazil is a good test case for a model that incorporates underregistration into small-area mortality estimation.

Population and deaths by sex and single-year age come from the 2010 Brazilian Demographic Census and from the Mortality Information System of the Ministry of Health (MS/SVS/CGIAE), respectively. The Mortality Information System is an official registry for deaths, with data collected from health providers (de Mello Jorge et al. 2007).

Population and registered deaths are available for 5,565 Brazilian municipalities,^{4} 100 single-year ages 0 . . . 99, and two sexes. For each of the 1,113,000 combinations of (municipality, age, sex), we recorded the 2010 census population, the number of registered deaths over calendar years 2009–2011, and geographic identifiers. We used the 2010 census populations to estimate age- and sex-specific exposure over 2009–2011 for each (municipality, age, sex) combination. Details are in Gonzaga and Schmertmann (2016).

Brazilian census geography begins with five major regions (North, Northeast, Southeast, South, and Center-West) comprising 27 *states*^{5} that cover the entire national territory. States are subdivided into *mesoregions* (137 total), mesoregions into *microregions* (558), and microregions into *municipalities* (5,565). Our goal in this analysis is to estimate age- and sex-specific mortality schedules and life expectancies for all 27 states and for each of the 558 microregions.^{6}

## Priors

### Using Death Registration Estimates From Previous Studies

Prior information about the likely levels and age patterns of death registration is essential for our model. In the case of Brazil, we identified six published estimates of death registration coverage, by state and sex, for 2010 (Freire et al. 2015; Instituto Brasileiro de Geografia e Estatística 2013; Queiroz 2012; Queiroz et al. 2013, 2017). These estimates all come from death distribution methods (DDM) and refer mainly to deaths at ages 30+. The six estimates differ substantially in some cases. For example, for males aged 30+ in the Northeastern state of Maranhão (widely regarded as the state with the lowest vital registration coverage), the alternative coverage estimates were 50 %, 66 %, 75 %, 78 %, 78 %, and 97 %. The range of these estimates suggests that in addition to being low, coverage levels for adult male deaths in Maranhão are uncertain. At the *municipal* level, we use published estimates of death registration coverage from a research project known as *busca ativa*,^{7} which we translate as *field audit*. The field audit project (Frias et al. 2013; Szwarcwald et al. 2011) randomly selected 133 municipalities in Brazil’s poorest regions and compared the registered deaths in 2009–2010 with total deaths found for the same period from an exhaustive search of notary offices, clinics, official and unofficial cemeteries, and from interviews with health workers, midwives, funeral homes, pharmacies, and others. Based on correlations between municipal-level characteristics and levels of mortality coverage in the study areas, the field audit researchers estimated the likely probability of death registration in all 5,565 Brazilian municipalities for infant deaths (π_{0}) and for all deaths (π_{all}). Field audit estimates can be aggregated to higher levels of geography—in particular, to microregion or state level—by using published information from the project on the estimated number of deaths in each municipality.

When aggregated to the 27 Brazilian states, field audit estimates of infant mortality coverage (π_{0}) range from 65 % (Maranhão) to 100 % in several southern states. The range is naturally wider across the 558 microregions: from 31 % in some remote Amazonian locations to 100 % in many southern microregions. Field audit estimates for death registration probabilities at all ages combined (π_{all}) range from 78 % to 100 % across states and from 44 % to 100 % across microregions.

### Prior Distributions for Coverage

To utilize the available coverage data in Brazil, we assume that in each small area, there are distinct probabilities of death registration for three age intervals:

where *w*_{x} terms represent the proportion of all deaths that occur in the corresponding age group.^{8}

Our choice of age groups was based in part on the availability of age ranges in the external coverage estimates—for example, DDM provides straightforward estimates of state-level π_{2}. However, our choice of age groups is also based on the demographic literature and expert opinions. First, many Brazilian studies indicate that coverage of infant deaths (π_{0}) is lower than coverage in any other age interval (Frias et al. 2013; Instituto Brasileiro de Geografia e Estatística 2013; Paes and Albuquerque 1999). This fits the pattern of other countries with defective vital records, in which infant deaths generally have the lowest coverage rates (Målqvist et al. 2008). Field audits also found high percentages of unreported infant deaths (Frias et al. 2013; Szwarcwald et al. 2011). Second, some Brazilian researchers suggest that external causes of death (such as homicides and transit accidents) have almost complete coverage in all Brazilian regions (Agostinho 2009; Campos and Rodrigues 2004). This is plausible because deaths by violence and transit accident must be reported to the local health department not only by family or relatives but also by the municipal coroner’s office. Although the quality of information varies between regions, the notification procedures are determined by national law, and the path of the death certificate from coroner’s office to the Mortality Information System is identical in all Brazilian states (Borges et al. 2012). Deaths from external causes in Brazil have increased in the last decades, and these deaths are concentrated among young adults (de Mello Jorge et al. 1997; Matos et al. 2013; Soares Filho et al. 2007). Taken together, this evidence suggests that coverage of infant deaths should not he higher than coverage of deaths at ages 1–29: π_{0} ≤ π_{1}.

For infant deaths (π_{0}) and for all deaths (π_{all}), we build priors from field audit estimates. For every area, we aggregate the corresponding municipal field audit estimates, weighted by deaths, to calculate estimated coverage levels, $\pi \u03020$ and $\pi \u0302all$. The associated priors are

where *K*_{0} and *K*_{all} are hyperparameters representing (uncertain) levels of precision from the field audit estimates.^{9}

For mortality coverage at ages 30+, which we denote as π_{2}, external information consists of state-level estimates from the six DDM studies. We use the mean and variance of the DDM estimates to estimate the parameters of a beta distribution by the method of moments. The resulting prior is

where ϕ_{2} is the mean of the six DDM estimates, and *K*_{2} is an estimated precision index.^{10} For example, for Maranhão males, the DDM estimates are (.50, .66, .75, .78, .78, .97), so we use π_{2,Maranhão} ∼ *Beta*(7.00 *×* 0.74, 7.00 *×* 0.26). This prior distribution answers the following question: Given the DDM estimates, how plausible are different levels of coverage for males 30 and older in Maranhão? The answer is probabilistic: a priori, there is a 10 % probability that the coverage level is below .52, a 50 % probability that it is between .64 and .86 (the interquartile range), a 10 % chance that it is above .92, and so on.^{11}

Prior information for π_{2} is at the state level, so when we estimate coverage for substate areas like microregions, the prior applies to the weighted registration probability:

_{i}is the proportion of state deaths at ages 30+ that occur in substate area

*i*∈ {

*a*. . .

*z*}, and π

_{2i}is the death registration probability in area

*i*.

^{12}

Finally, we use qualitative information from the literature by adding a prior that completely rules out any triples of local coverage probabilities that do not match our assumed order:

That is, we insist that in every local area, infant mortality coverage cannot be higher than coverage at other ages and that coverage of deaths at ages 1–29 cannot be lower than at other ages.

### Prior Distribution for TOPALS Parameters

TOPALS parameters **α** ∈ *ℝ*^{7} determine the shape and level of the age-specific mortality schedule. We use a vague prior for **α**, so that local death and exposure data are the primary determinants of rate estimates, via the likelihood function. Our prior for **α** is a multivariate normal distribution derived from two principles: (1) each α_{i} component should have very similar prior probabilities over a wide range of possible values, and (2) very large differences between consecutive components α_{i}*−* α_{i − 1}, *i* = 1 *. . .* 6 are unlikely.

*N*(

**0**, Σ) with covariance matrix

^{13}

The marginal priors for each α_{i} component are uninformative about levels, which are measured on the log-mortality rate scale. For example, the *a priori* distribution for the infant mortality offset is α_{0} ∼ *N* (0, 3.11), implying a 57 % prior probability that α_{0} falls in [–1.39, +1.39]. This is a very vague prior assumption: it says that there is a 57 % probability that a region’s infant mortality rates is between one-fourth and four times the rate in the standard schedule and a 43 % prior probability that rates might be even more extreme. The correlation structure in Σ is also only weakly informative about differences between **α** components; it serves mainly to stabilize estimates in extremely small populations with very few deaths by giving slightly higher prior probabilities to simple up-and-down shifts in the standard schedule (which occur when α_{0} = α_{1} = *· · ·* = α_{6} and differences between consecutive αs are all zero).

## Results

We implemented the full model in *Stan* (Carpenter et al. 2017), a language that allows MCMC sampling from complex posterior distributions. For both sexes, we estimated posterior distributions of mortality parameters (**α**), complete log mortality schedules (**λ**(**α**) = **λ**^{∗} + **Bα**), and life expectancy (*e*_{0}[**λ**(**α**)]) for all 27 Brazilian states and all 558 microregions. We call this the *adjusted* model. In order to learn about the effects of underregistration of deaths, we also estimated the same posterior distributions under the (incorrect) assumption of 100 % registration (π_{0} = π_{1} = π_{2} = 1). We call this second version the *unadjusted* model. We focus here on summary results for male life expectancy. Complete results, together with data and code, are available on the article’s companion website (http://mortality-subregistration.schmert.net/).

## Effects of Imperfect Mortality Coverage π on Estimated State-Level Life Expectancy

Figure 4 illustrates posterior distributions of male life expectancy for two states: the small state of Amapá (total population *≈* 0.6 million) and the Federal District that contains Brasília (*≈* 2.5 million). The figure includes posterior densities before any adjustment for underregistration (on the right in each panel) and after adjustment (on the left).

Life expectancy is a complicated nonlinear function of the underlying **α** parameters, but the Bayesian approach permits easy calculation of uncertainty in *e*_{0} for both the unadjusted and adjusted estimates, via Eq. (8). Figure 4 shows that even if death registration were complete (right-hand densities), there would still be considerable uncertainty about state-level life expectancy because of sampling variability. Sampling uncertainty is naturally greater for Amapá, which has a smaller population, but is nontrivial even for the Federal District, which has a male population of more than 1 million.

Adjusting for underregistration lowers life expectancy estimates. The difference can be very large in areas where coverage is poor. A probabilistic model allows us to estimate the magnitude of these effects. In Amapá, for example, plausible corrections for underregistration of deaths reduce male *e*_{0} by approximately three years, as illustrated by difference between the unadjusted and adjusted posterior medians in the top panel of Fig. 4. The corresponding adjustment is very small for the Federal District, where death registration is nearly complete.

In addition to lowering the mean, uncertainty about vital registration coverage also increases uncertainty about life expectancy. This effect is visible for both states, but it is larger for the state with lower coverage levels (Amapá). As with the decrease in means, the fact that the variance will increase is qualitatively obvious. However, a carefully constructed probabilistic model allows demographers to estimate the increase in uncertainty caused by imperfect vital registration.

Figure 5 shows estimated 2010 male life expectancies for all 27 states, disaggregated by region. It reports the Bayesian model’s death registration adjustments and distributional information using the abbreviated format on the horizontal axes of Fig. 4. The results for Amapá (AP) and the Federal District (DF) in Fig. 4 appear on the 4th line from the top and the 11th line from the bottom, respectively, of Fig. 5.

For each state, Fig. 5 includes the unadjusted male life expectancy (*e*_{0}) calculated directly from registered deaths (open circles) and the adjusted estimates and 80 % posterior interval from our partial-coverage model (two-letter abbreviations and shaded bars). Horizontal distances between open circles and state abbreviations in Fig. 5 illustrate median adjustments in life expectancy due to unregistered deaths.

Accounting for underregistration of deaths leads to large downward adjustments in life expectancy in the North and Northeast regions, and to small or negligible downward adjustments elsewhere. The most extreme adjustment is for the state of Maranhão (MA) in the Northeastern region. Direct use of death registration data would suggest that Maranhão has Brazil’s highest male life expectancy (74.3 years). In contrast, adjusted Bayesian estimates for Maranhão have a median of 70.4 years, almost four years lower. Northern states, such as Pará (PA: 72.2 *→* 68.6 years), Amapá (AP: 72.7 *→* 70.0), and Amazonas (AM: 72.3 *→* 69.8), also require large downward adjustments in *e*_{0} to account for likely levels of unregistered deaths.

### Comparison With Brazil’s Official State-Level Estimates

Bayesian posterior distributions for *e*_{0} come from a range of plausible state-specific registration coverage levels. They are therefore useful as comparative benchmarks for the official state-level estimates from IBGE, Brazil’s census bureau. IBGE uses a complex, multistep procedure to correct for underregistration of deaths (Instituto Brasileiro de Geografia e Estatística 2013). Their procedure combines different indirect methods for mortality and coverage estimation, and differs by region. As a result of this complexity, researchers (including us) have been unable to replicate the official state-level estimates.

A comparison to adjusted and unadjusted means from vital registration data is therefore useful for understanding the official state-level estimates.

Filled circles in Fig. 5 correspond to IBGE estimates for male *e*_{0} in each state. Thus, when a filled circle falls to the left of the open circle, the IBGE estimate is equivalent to assuming that deaths are underregistered. A filled circle to the right of an open circle is equivalent to assuming *over*registration of deaths.

Figure 5 shows that official life expectancy estimates for states in the South and Southeast (bottom panels) are implausibly high. Comparison with registered death data indicates that IBGE estimates are plausible only if the vital registration system substantially overcounts deaths in these states.^{14}

Figure 5 also shows that the IBGE estimates for states in Brazil’s North and Northeast regions are implausibly low. For almost all states in the North and Northeast, the official *e*_{0} estimate for males falls far below the 10th percentile of the adjusted posterior distribution. In other words, likely combinations of vital registration coverage and mortality schedules in these states produce life expectancies that are much higher than the official estimates.

### Signal and Noise in State Life Expectancy Estimates

Comparisons between areas are often important for public policy purposes, including allocation of health expenditures and other resources. Realistic estimates of the uncertainty of life expectancy highlight potential problems with allocating resources based on the most likely point estimates, even for areas with large populations. As noted earlier, states are not especially small areas: the least-populous Brazilian state (Roraima) has nearly half a million residents. Even so, after adjusting for likely underregistration of deaths, determining which of a pair of states has the higher life expectancy can be difficult.

Fig. 6 shows 80 % credible intervals for each state’s male *e*_{0} in the full model, including likely underregistration of deaths. As in the previous plot, the horizontal bars span the interval between the 10th and 90th percentiles of the posterior distribution. Solid dots indicate the posterior medians. States are sorted in order of posterior medians, which we use as best-guess point estimates. The Federal District has the highest estimate (72.3), Santa Catarina has the second-highest (71.9), and so forth. Some states, such as Rio Grande do Sul (median = 70.9) or Paraná (70.3) have very large populations and almost perfect vital registration. Consequently, certainty about *e*_{0} is very high in those states. In contrast, some of Brazil’s sparsely populated northern states such as Roraima (70.6) or Amazonas (69.8) have smaller populations and coverage levels that are much less certain. In these cases, we are much less certain about which state actually has the higher life expectancy.

Samples from the posterior distribution allow us to make probabilistic statements about the ranking of areas, which could be important for allocation of health resources. For example, the posterior medians of *e*_{0} in Amazonas and Roraima are 69.8 and 70.6, respectively, but in 730 of 4,000 samples from the posterior in Eq. (8), the Amazonas life expectancy was higher. Thus, even though the point estimate for Amazonas life expectancy is almost one year lower, the *a posteriori* estimate is $Pe0AM>e0RO\u2248.18$.

Analysis of posterior samples also allows probabilistic statements about which states have the lowest and highest life expectancies. For example, Alagoas had the lowest *e*_{0}(**α**) value in 3,592 of 4,000 samples, implying an approximate 90 % probability that this state has a lower male life expectancy than all others. Alternative candidates for lowest male *e*_{0} are Pará (*p*_{lowest} = 0.06) and Pernambuco (*p*_{lowest} = 0.01). Analogous calculations show that the Federal District (*p*_{highest} = 0.91), Piauí (*p*_{highest} = 0.05), and Rio Grande do Norte (*p*_{highest} = 0.02) are the top candidates for highest male life expectancy.

### Microregion-Level Estimates

We also used the Bayesian TOPALS model to estimate adjusted posterior distributions of **α** and *e*_{0}(**α**) separately by sex for all 558 Brazilian microregions. Results for males appear in Fig. 7, which displays posterior medians, and in Fig. 8, which displays the widths of each microregion’s 80 % posterior probability interval.

The point estimates in Fig. 7 show that high-mortality (low-*e*_{0}) regions for males tend to be concentrated along the Atlantic coast, particularly in large cities. Pockets of high mortality are also found in scattered areas of northern and western Brazil. Point estimates of mortality are especially low in parts of southern Brazil and in the Northeastern states of Piauí and Ceará.

Some of the point estimates in Fig. 7 are much more reliable than others, however. Figure 8 shows a strong North-South gradient in the precision of male *e*_{0} estimates. Posterior distributions (analogous the adjusted densities for states in Fig. 4) are notably narrower in southern microregions. As a result, we can be much more certain about small-area life expectancies in Brazil’s South and Southeast than in other regions.

Differences in the uncertainty of these small-area estimates arise for two reasons. First, at any given level of vital registration coverage, microregions with larger populations will have more reliable mortality estimates because of larger sample sizes. Many regions in northern and western Brazil are very sparsely populated; thus, even if registration coverage were known exactly, local estimates would still be subject to considerable sampling error. Second, the level of underregistration of deaths is also uncertain in many small areas. The Bayesian model accounts for this extra source of uncertainty, resulting in wider posterior distributions for small areas with more uncertain vital registration coverage.

### Signal and Noise in Microregion Life Expectancy Estimates

We have already highlighted the difficulty of ranking or comparing state-level *e*_{0} estimates. For smaller areas, the signal-to-noise ratio changes for the worse, and comparisons become even more difficult. As an example, Fig. 9 shows Bayesian-adjusted estimates of male *e*_{0} for the 32 microregions in the state of Bahia (highlighted with a thick border in Figs. 7 and 8). At this level of geography, rankings of areas and differences in estimated life expectancy are largely overwhelmed by uncertainty. For example, Livramento do Brumado (1st line of Fig. 9) has the highest estimated male *e*_{0}, with a posterior median of 73.3 years. But the posterior probability that this microregion has the highest male life expectancy is only 37 %: it ranked highest in 1,479 of 4,000 posterior samples from the posterior in Eq. (7). Cotegipe (2nd line, with a posterior median of 73.0) has a 29 % chance of being the highest, Jeremoabo (3rd line, 72.2) has a 9 % chance, and so on.

Consider comparing Livramento do Brumado (location 1, on the top line of Fig. 9, estimated *e*_{0} = 73.3 years) and Boquira (location 11, 11th line, 71.6 years). Repeated sampling from the posterior distribution (local priors + local data) produces a large number of plausible ($e01,e011$) pairs for male life expectancies in these two places, illustrated in Fig. 10. Uncertainty is a result of small sample sizes, very small numbers of registered deaths at some ages, and imprecise estimates of local death registration probabilities. Although the point estimate for $e01$ is nearly two years higher than the estimate for $e011$, considerable uncertainty remains about which region has the higher life expectancy. Given data and priors about death registration, the posterior probability that male life expectancy is actually higher in Boquira than in Livramento do Brumado is 11 % ($e011\u2212e01>0$ in 447 of 4,000 samples), and there is even a 3 % probability that it is more than one year higher ($e011\u2212e01>1$ in 107 of 4,000 samples).

The main message of Figs. 9 and 10 lies in the very high overlap between the posterior distributions of many of the microregions. These are not especially small areas: Livramento do Brumado was the least-populous microregion in Bahia, but it had a total population of 97,786 in 2010, and the median total population of Bahian microregions was slightly below 290,000.

Despite these fairly large areas, however, uncertainty dominates most pairwise comparisons. It is clear that at this geographic level, researchers and policy makers should not rely on point estimates to distinguish high- and low-mortality areas—especially if differences in best-guess estimates of median *e*_{0} are less than one year. That result applies even more strongly to smaller areas, such as municipalities.

## Conclusion

A Bayesian model is a natural approach to estimating small-area mortality when the vital registration system is imperfect. In this article, we show that it is straightforward to combine age-specific mortality rates and coverage probabilities in a unified model.

The TOPALS relational model for age-specific rates is a useful component in Bayesian modeling for small areas. Because it is flexible but low dimensional, the TOPALS parametric model for age-specific mortality schedules makes it possible to estimate rates and life expectancies even for very small populations.

It is illuminating to consider correction for underregistration as a statistical, rather than arithmetic, process. An explicitly statistical model naturally emphasizes the uncertainty of results. Brazilian microregions are not especially small areas. Nevertheless, the uncertainty in life expectancy estimates is often much greater than the differences between local point estimates.

One of our main substantive findings concerns the signal-to-noise ratio in small-area estimates. The levels of posterior uncertainty in microregional life expectancy estimates for Brazil make it clear that attempts to estimate mortality at even smaller levels of geography (for example, for the 417 municipalities in Bahia that are subregions of those displayed in Fig. 9) face serious and possibly insurmountable difficulties. Even with good statistical methods for estimating mortality in very small populations, realistic assessment of uncertainty suggests that drawing meaningful distinctions about the mortality of those populations may be extremely difficult. Demographers should stay humble.

## Acknowledgments

This research was supported by the Capes Foundation of Brazil’s Ministry of Education. Marcos R. Gonzaga gratefully acknowledges support from Research Projects 470866/2014-4 (Estimativas de mortalidade e construção de tabelas de vida para pequenas áreas no Brasil, 1980 a 2010 MCTI/CNPQ/MEC/Capes/Ciências Sociais Aplicadas) and 454223/2014-5 (Estimativas de mortalidade e construção de tabelas de vida para pequenas áreas no Brasil, 1980 a 2010/MCTI/CNPQ/Universal 14/2014).

### Appendix: Statistical Distribution of Registered Deaths

*Y*, using a mixture of heterogeneous risks, is (Greene 1997:939–940)

*z*is a multiplicative risk factor with density

*g*(

*z*) over positive real numbers. This mixture model,

*Y*∼

*PoissonMix*(λ,

*g*), describes the distribution of count variable

*Y*∈ {0, 1, 2, . . .} in terms of a scalar parameter λ and a density function

*g*(). It generalizes the Poisson distribution by allowing the mean and variance of

*Y*to differ. In particular, it provides a framework for modeling overdispersion (

*V*(

*Y*)

*> E*(

*Y*)), which is often observed in count data.

The mixture model includes the standard Poisson distribution as a limiting case: as the distribution *g*(*z*) approaches a constant at *z* = 1, *Y*’s distribution approaches a Poisson with *E*(*Y*) = *V* (*Y*) = λ. It also includes the negative binomial distribution: if *g*(*z*) is a gamma density with *E*(*z*) = 1 and *V*(*z*) = 1 / θ, then *Y* has a negative binomial distribution with *E*(*Y*) = λ and *V*(*Y*) = λ + (λ^{2} / θ). Other {λ, *g* ()} mixtures yield other discrete distributions for *Y*.

*D*deaths is

*R, D*) is

*R*) and unregistered deaths (

*U*=

*D − R*), the same expression is

*R*) is therefore

*R*) therefore has exactly the same mathematical form as the marginal distribution of total deaths (

*D*), except that parameter λ is replaced with λπ. That is,

*D*~

*Poisson*or

*D*∼

*NegBinom*, as well as to other Poisson mixtures. Most importantly for this article, it demonstrates that if total deaths have a Poisson distribution with expected value λ =

*N*μ, then registered deaths also have a Poisson distribution, but with expected value λπ =

*N*μπ.

## Notes

^{1}

In the appendix, we also demonstrate that a negative binomial distribution for total deaths implies a negative binomial distribution for registered deaths. A negative binomial model would be appropriate if the data exhibit *overdispersion*—that is, higher variance than predicted by a Poisson model. With the Brazilian data that we use in this article, extensive experimentation produced no evidence of meaningful overdispersion, and posterior distributions of mortality rates were virtually identical with Poisson and negative binomial specifications. We therefore use a Poisson distribution for *D* in this article.

^{2}

By omitting an explicit prior, we assume *a priori* that μ is equally likely to take any positive real value. The omitted (improper) prior is therefore *f*_{μ}(μ) ∝ *I*(μ ≥ 0), where *I*() is a (0,1) indicator function. This yields a proper posterior distribution for (μ, π) and a proper marginal posterior for μ in Eq. (4).

^{3}

Gonzaga and Schmertmann (2016) showed that this property makes the specific choice of a standard schedule **λ**^{*} far less important than in other relational models used in demography. Note that the TOPALS model includes indirect standardization as a special case in which all **α** values are equal and the standard schedule is shifted up or down by the same amount at all ages.

^{4}

In Brazil, municipalities are the smallest areas responsible for registering vital events.

^{5}

For simplicity we consider the Federal District that contains Brasília to be a state.

^{6}

Even microregions are fairly large “small areas.” With a single exception (the remote island of Fernando de Noronha, with a total resident population of only 2630 in 2010), all had resident populations of at least 20,000 in 2010. Rounded to the nearest thousand, the 10th, 50th, and 90th percentiles of microregional population were, respectively, 63,000, 173,000, and 557,000. The largest microregion, metropolitan São Paulo, had a 2010 population of more than 13 million.

^{7}

The complete name in Portuguese is *Busca ativa de óbitos e nascimentos no Nordeste e na Amazônia Legal* (Active search for deaths and births in the Northeast and the Amazonian administrative region).

^{8}

In practice, we used identical weights for each region: *w* = (.035, .109, .856) for males and *w* = (.037, .047, .916) for females. These were calculated from national deaths over 2009–2011.

^{9}

Hyperparameters, *K*, correspond to sample sizes in a field audit. Prior uncertainty about *K* represents uncertainty about the precision of the field audit estimates of π. Our (hyper)priors for *K* are fairly conservative: they imply that the most likely precision of the field audit estimates is equivalent to results from an audit of slightly fewer than *K* = 25 deaths in a region.

^{10}

Denoting the mean and variance of DDM estimates as $x\xaf$ and *s*^{2}, the method of moments estimators (cf. Glen and Leemis 2017:227–228) are $\varphi 2=x\xaf$ and $\Kappa 2=x\xaf1\u2212x\xafs2\u22121$.

^{11}

Priors based on *busca ativa* estimates are constructed from a single coverage estimate for each region by adding a hyperparameter for the estimate’s unknown precision. In contrast, priors from DDM estimates are based on multiple estimates per region and use the variance of those estimates as an index of (im)precision. A third alternative, which we do not use here, is to choose beta distribution parameters ϕ and *K* so that available estimates are all in a specified range of prior probability—for example, a 90 % probability that *π* ∈ [min(*DDM*), max(*DDM*)].

^{12}

Because we have only state-level prior information about death registration in this age group, we can assess only the joint prior probability of a *set* of substate coverage levels, (π_{2α} . . . π_{2z}), by looking at whether their weighted average is likely.

^{13}

This prior distribution arises from two lines in the *Stan* programming language that we use for MCMC sampling. From our first principle (diffuse marginal distributions for each α_{i}), we add **α ~***normal*(0,4) to the model. From the second principle (small differences between consecutive parameter values) we add α_{i} – α_{i – 1} ~ *normal*(0, sqrt(0.5)), as in Gonzaga and Schmertmann (2016). These statements in *Stan* represent changes to the log prior density of any proposed **α** vector, which together yield this specific multivariate normal distribution. The results that we report in this article are extremely insensitive to the choice of priors for **α**.

^{14}

The high estimates for life expectancy in the South and Southeast probably result from IBGE’s *compatibilização* step (Instituto Brasileiro de Geografia e Estatística 2013: tables 6 and 13), in which they adjust national totals by removing deaths from these two regions.

## References

*Estudos e pesquisas. Informação demográfica e socioeconomic*[Sex- and age-specific abbreviated life tables for Brazilian states and major regions in 2010: Studies and research. Demographic and socioeconomic information].

*N*parameter: A hierarchical Bayes approach