## Abstract

Widespread population aging has made it critical to understand death rates at old ages. However, studying mortality at old ages is challenging because the data are sparse: numbers of survivors and deaths get smaller and smaller with age. I show how to address this challenge by using principled model selection techniques to empirically evaluate theoretical mortality models. I test nine models of old-age death rates by fitting them to 360 high-quality data sets on cohort mortality after age 80. Models that allow for the possibility of decelerating death rates tend to fit better than models that assume exponentially increasing death rates. No single model is capable of universally explaining observed old-age mortality patterns, but the log-quadratic model most consistently predicts well. Patterns of model fit differ by country and sex. I discuss possible mechanisms, including sample size, period effects, and regional or cultural factors that may be important keys to understanding patterns of old-age mortality. I introduce *mortfit*, a freely available R package that enables researchers to extend the analysis to other models, age ranges, and data sources.

## Introduction

In the developed world, life expectancy at birth continues to rise, and fertility levels are low. At the same time, much of the developing world has seen declines in birthrates and improvements in child and adult survival (United Nations Population Division 2015). Together, these trends have led demographers to forecast widespread and rapid population aging (Gerland et al. 2014). This shifting demographic landscape raises critical scientific and policy questions about the nature of human longevity (Kinsella and Phillips 2005; Martin and Preston 1994). To understand this phenomenon, researchers have proposed several different theories that aim to explain and predict patterns of death rates at advanced ages.^{1} This study carefully assesses the amount of empirical support for these proposed theories.

Understanding theoretical models of old-age mortality is important for several reasons. First, these models are critical to progress in answering a key question in the science of aging: Can we expect lifespans to increase indefinitely, or is there some upper limit at which our biology renders continued improvements essentially impossible (Dong et al. 2016; Lenart and Vaupel 2017)? To answer this question, researchers need evidence about empirical regularities in old-age death rates. Unfortunately, the sparse numbers of deaths at advanced ages has made establishing these empirical regularities quite difficult. For example, although researchers have widely accepted that death rates accelerate over the course of middle and early-old ages, they have reached less agreement about what happens to mortality at advanced ages: whereas some models of old-age mortality imply that death rates will continue to increase at oldest ages (Dong et al. 2016; Gompertz 1825; Makeham 1860), other models predict that death rates will decelerate or even plateau (Gavrilov and Gavrilova 2011; Horiuchi and Wilmoth 1998; Thatcher et al. 1998). Second, many policy-makers and planners need accurate, mathematical summaries of death rates at advanced ages in order to produce forecasts and projections. For example, old-age mortality forecasts are a critical input to planning for health care and pension systems (Bongaarts 2005; Tabeau 2001). Finally, understanding old-age survival patterns is important for a number of other research questions of great relevance to sociology, economics, and public policy. For example, a mathematical description of old-age mortality is needed to build a behavioral model of how increases in longevity are related to changes in savings and investment behavior (Sheshinski 2007).

## Background and Theory

### Mortality Models

*population hazard function*(Fig. 1; online appendix, section B.1). For a group of people, the

*population hazard function*is defined as follows:

*S*(

*z*) is the

*population survival function*—that is, the proportion of group members that survives to exact age

*z*.

*z*) and the probability of dying between ages

*z*and

*z*+

*k*, conditional on surviving to age

*z*:

Scientifically, it is important to understand both how well a particular theory fits empirical data in an absolute sense and how well a theory fits empirical data relative to other theories. Understanding how well a particular theory fits the data in an absolute sense is critical to its plausibility: a theory that is entirely inconsistent with empirical evidence must be discarded. Understanding how well a theory fits relative to other theories is critical to progress: even a theory that produces adequate fits to empirical data may be replaced by an alternate theory that consistently fits the data better. Relative comparisons of theories might also reveal that there is no consistent pattern to which theory fits data best, and thus multiple theories may have merit; that different theoretical mechanisms are important for different subgroups; or that adequate explanatory theories have not yet been proposed.

### Functional Forms for the Population Hazard

The most prominent models of old-age mortality have been expressed as hazard functions (Table 1). I now briefly review the theory associated with each of these models.^{2} These models can be described in four types.

The first type contains models that can be considered generalizations of the Gompertz model. The Gompertz model is the oldest mathematical description of mortality; it was first proposed as a descriptive theory that could closely capture empirically observed risks of death, which increase rapidly across adult ages (Gompertz 1825). Researchers following Gompertz developed mechanism-based probability models that can produce Gompertzian mortality dynamics (Strehler and Mildvan 1960). Gompertz is considered to provide a good, stylized description of adult mortality throughout the middle to early-old age range, but some researchers argue that it fails to capture the dynamics of oldest-age mortality patterns (e.g., Vaupel et al. 1998). The Makeham model was designed to generalize the Gompertz model by adding a parameter that captures age-invariant background mortality; this background mortality could, for example, describe everyone’s risk of death due to environmental factors that do not vary with age (Makeham 1860). The log-quadratic model, which emerged from biodemographic investigations of lifespan and aging rates, can be seen as a generalization of the Makeham model that adds a parameter allowing increases in death rates to decelerate at advanced ages (Coale and Kisker 1990; Steinsaltz and Wachter 2006; Wilmoth 1995).

The second type contains the Weibull model, which can be motivated by drawing analogies between human biology and theories of reliability rooted in engineering and manufacturing. These reliability theories describe systems comprising multiple components, each of which has some time-dependent risk of failure independent of the other components (Weibull 1951). The Weibull model would make sense if, for example, the human body could be accurately modeled as a set of independent organs, with death resulting when any of the organs fails.

The third type contains the logistic, Beard, Perks, and Kannisto models. These four models arose from research on heterogeneity in mortality that postulated a distribution over individual hazard functions, leading to a population hazard that is a continuous mixture of these individual hazards (Beard 1959; Perks 1932; Vaupel et al. 1979). For example, the Beard model can be derived by assuming that individuals face a Gompertz hazard and that each individual’s hazard is scaled by a Gamma-distributed variate, often called “frailty” (Beard 1959; Horiuchi and Wilmoth 1998; Manton et al. 1981). The Kannisto model is unique in this set because it was not derived from first principles; instead, it was proposed as a heuristic two-parameter approximation of the more complex heterogeneity models (Himes et al. 1994; Thatcher et al. 1998).

Finally, the fourth type contains the Lynch-Brown model. This model was devised to describe and interpret observed crossovers in death rates between black and white populations in the United States (Lynch and Brown 2001; Lynch et al. 2003).

To recap, many theories have been proposed to explain the patterns of mortality at oldest ages. Each theory corresponds to a population hazard function that mathematically describes the predicted risk of death by age for a group of people. Population hazards can be converted into expected numbers of deaths by age, and predicted deaths by age can then be compared with observed deaths by age to determine how well each theory agrees with empirical evidence. Section E of the online appendix provides additional details and references to the literature.

## Data

To empirically test the models in Table 1, I fit each model to data from the Kannisto-Thatcher (K-T) Database on Old Age Mortality. The K-T database is a carefully curated collection of data on mortality at advanced ages, with information on deaths and exposure beyond age 80 for 35 countries, with some of the Scandinavian data going back to the mid-eighteenth century (Kannisto et al. 1994; MPIDR 2014). Because data quality for mortality at advanced ages is a serious concern (e.g., Black et al. 2017), I analyze only the subset of country-cohorts found to be of high quality by the expert review of Jdanov et al. (2008).^{3} Specifically, I retain the cohort data from all countries where more than one-half of the years were of the highest quality, and the remaining years were of the second-highest quality in the assessment the authors provided. Furthermore, I select only those countries and periods for which deaths were reported by single year of age up to 104. This leaves data from Denmark, France, West Germany, Italy, Japan, the Netherlands, Sweden, and Switzerland. In total, the analysis data set has 360 cohorts of data on survivors and deaths by sex and single year of age from 10 countries. Table 2 describes the years and cohorts in the analysis data set, and section A of the online appendix has more detail on how the analysis sample was chosen.

## Methods

### Likelihood and Estimation

*N*

_{z}people from a cohort survive to exact age

*z*and all members of the cohort face the same hazard π(

*z*), then cohort deaths between exact ages

*z*and

*z*+ 1 are distributed binomially:

*D*

_{z}is the number of deaths between ages

*z*and

*z*+ 1, and π(

*z*,

*z*+ 1) is the probability of dying between ages

*z*and

*z*+ 1 (Chiang 1960).

**θ**), an observed sequence of deaths (

**D**=

*D*

_{1},

*D*

_{2}, . . .), and survivors to each age (

**N**=

*N*

_{1},

*N*

_{2}, . . .) is then

*K*is a constant that does not vary with

**θ**. To fit a particular hazard model to a data set (

**D**,

**N**) from a particular cohort, Eq. (4) is maximized as a function of

**θ**. The computation required to perform this maximization is not trivial; therefore, sections B.2 and B.3 of the online appendix provide a more detailed description of the methods used to fit each cohort data set and introduce the freely available R package

*mortfit*, which other researchers can use to develop and fit their own mortality models.

^{4}

Fitting mortality hazards by building a likelihood-based model for the observed deaths by age has three advantages. First, it is uncommon for people to survive to old ages, making absolute numbers of deaths and survivors small and rapidly decreasing by age. Fitting hazards using maximum likelihood naturally accounts for the amount of data available at each age; intuitively, older ages with fewer survivors and deaths have less information about likely parameter values and thus have less influence on parameter estimates than younger ages, for which ample data are observed. A different approach, such as fitting a regression to observed central death rates, can result in parameter estimates that are heavily influenced by noisy observations at the oldest ages. Several studies have argued for the benefits of likelihood-based estimates of mortality models (Manton et al. 1981; Pletcher 1999; Promislow et al. 1999; Steinsaltz 2005; Wang et al. 1998). Second, fitting mortality hazards using likelihood-based inference enables the use of principled model selection techniques to help determine which model is most consistent with the data. The theory for one important technique—Akaike’s information criterion (AIC)—relies on likelihood-based inference. (Later in this article, I discuss why these principled model selection techniques are important.) Finally, I model the counts of deaths by age because those are the data that are actually observed; modeling another quantity, such as central death rates, would require making additional assumptions.

### Model Selection

where *z* ranges over the ages in the data set. This quantity provides a natural measurement of each model’s accuracy—that is, how close each model’s predictions come to the observed numbers of deaths, in absolute terms.

The statistical literature on model selection has revealed that accuracy is only one of two important factors in assessing model fit. The second important factor is generalizability, which is the extent to which a model’s performance on the observed data set can be expected to be replicated on future data sets that arise from the same data-generating process. Overfitting is a crucial consideration here (Burnham and Anderson 2003; Claeskens and Hjort 2008). Models with many free parameters are more flexible, meaning that they can bend to fit the shape of observed patterns in the data; thus, models with many free parameters can be expected to fit any particular data set closely. This flexibility could be advantageous or disadvantageous: a more flexible model would be able to capture more complex relationships between age and the risk of death. However, a more flexible model could also overfit—that is, it might pick up noisy artifacts because of small sample sizes at old ages.

*k*is the number of parameters being estimated. Although Eq. (6) looks simple, it has been motivated by extensive theory. This theory shows that the AIC selects the candidate model that minimizes the expected Kullback-Leibler distance between the distribution of data implied by the model and the one that generated the data (Akaike 1974; Burnham and Anderson 2003; Claeskens and Hjort 2008). The derivation of the AIC reveals that the second term (2

*k*) is a bias adjustment that corrects for overfitting; thus, the AIC results are an indication of how well each hazard function trades off the number of parameters estimated with the accuracy of its fit to the data.

The derivation of the AIC reveals that absolute AIC values are not interpretable; however, two models can be compared by examining the difference between their AICs. Thus, the results I present focus on a quantity called ΔAIC. For a particular model fit, ΔAIC is the difference between (1) the AIC value attained by the model, and (2) the AIC value obtained for the model that best fit the data set. Thus, for the best-fitting model, ΔAIC is 0; for other models, ΔAIC will be greater than 0. The closer ΔAIC is to 0, the better its fit.

The following results focus on the AIC because it accounts for the tension between model complexity and accuracy, it is motivated by rigorous statistical theory, and it is widely used in the natural and social sciences. Alternative theoretical perspectives have been used to develop other principled model-selection criteria. Section C of the online appendix describes two of these alternatives—the Bayesian information criterion (BIC) and *K*-fold cross-validation—and compares the results based on the AIC to analogous results using these two alternative methods. I believe that the theory that motivates the AIC is most relevant to the goal in this study, but more detailed discussions of model selection, including comparisons between the AIC and BIC, can be found elsewhere (Burnham and Anderson 2003, 2004; Claeskens and Hjort 2008; Hastie et al. 2009; Hoeting et al. 1999; Kass and Raftery 1995).

## Results

Each model in Table 1 was fit separately for males and females in each cohort in Table 2. To build intuition, I first illustrate the results for a single cohort and then examine the results for the entire data set and for specific countries.

### An Illustrative Example

Figure 2 shows the observed data and fitted models for Danish males born in 1895. The circles show the observed central death rates, and the area of each circle is proportional to the number of person-years of exposure observed at each age. Figure 2 shows that there is much more information about the cohort’s mortality experience at age 80 than at age 100. By the oldest ages, the amount of data has diminished dramatically, meaning that the data contain relatively little information about the underlying mortality process at oldest ages. The curves in Fig. 2 show the maximum likelihood fit of each model from Table 1. Some models apparently fit the observed data better than others: for example, the Lynch-Brown model is able to bend at oldest ages to account for what may be a deceleration in the observed central death rates, while the Gompertz model appears to be forced to forgo fitting the small amount of data at the highest ages in favor of closely fitting the majority of deaths, which are concentrated at earlier ages.

Table 3 shows several quantitative summaries of the fits shown in Fig. 2. The models are ordered by the rank that they attain using the AIC, where rank 1 is the model that fits the data the best. Table 3 also shows the log-likelihood, the sum of squared errors in the estimated number of deaths at each age (SSE), the SSE rank, and the difference between each model’s AIC and the minimum AIC (ΔAIC). According to the SSE, the fitted deaths from the Lynch-Brown model are closest to the observed deaths for this cohort. However, the four-parameter Lynch-Brown model is very flexible, and the SSE makes no adjustment to protect against overfitting. Table 3 shows that the AIC—which penalizes each model’s complexity to guard against overfitting—suggests that the Kannisto model performs the best. Thus, although the Kannisto model is slightly less accurate than the Lynch-Brown model, this is compensated for by the fact that the Kannisto model uses only two parameters while the Lynch-Brown model uses four. The AIC suggests that if we saw more data from the same cohort, the fitted Kannisto model would be likely to fit this unseen data more accurately than the fitted Lynch-Brown model.

### Results for All Cohorts

Figure 3 shows boxplots that summarize the distribution of ΔAIC across all 360 cohorts, by sex. Note that these results essentially weight the mortality experience of each country by the number of cohorts that it contributes to the sample (Table 2). Several observations emerge from this figure. First, according to the median ΔAIC value, no single model clearly performs the best; instead, several of the models have very similar performance. For example, across both sexes, the log-quadratic, Beard, and Kannisto models have the lowest median ΔAIC values, and those median ΔAIC values are all very close. Second, Fig. 3 shows much more variation in ΔAIC for some models than for others; for example, the interquartile range in ΔAIC across female cohorts is 2 for the Perks model but 15 for the Kannisto model. The Weibull model consistently performed poorly, so I do not discuss it further.

The variation in ΔAIC (Fig. 3) reveals more than one factor to consider when evaluating the fit of each model. It is desirable to find a model that (1) often fits the data well and (2) rarely fits the data poorly. Burnham and Anderson (2003) proposed rules that can help interpret ΔAIC with these two goals in mind: models with “substantial support” from the data have ΔAIC ≤ 2, and models with “essentially no support” from the data have ΔAIC > 10 (Burnham and Anderson 2003:72).

Figure 4 uses these rules to compare the fraction of cohorts for which each model fits the data well (*x*-axis) and the fraction of cohorts for which each model fits the data poorly (*y*-axis). Several observations can be made about Fig. 4. First, according to ΔAIC, the log-quadratic model appears to come closest to satisfying the two goals: it often fits the data well (ΔAIC ≤ 2 for about three-quarters of cohorts), and it rarely fits the data poorly (ΔAIC > 10 for fewer than one-tenth of cohorts); the Beard model performs almost as well as the log-quadratic. Second, several models that do not fit the data extremely well also avoid fitting the data extremely badly. In particular, the most flexible models—Perks, logistic, and Lynch-Brown—do not often fit the data the best (all have ΔAIC ≤ 2 for less than one-half of the cohorts), but these four-parameter models also rarely fit the data very poorly (all have ΔAIC > 10 for less than one-tenth of the cohorts). Finally, Fig. 4 reveals that most of the patterns of model fit are very similar for male and for female cohorts. There are three notable exceptions: the Kannisto, Gompertz, and Makeham models all show bigger differences between the fits for males and females than the other models do. For all three models, the point on Fig. 4 summarizing fits to male cohorts is below and to the right of the point summarizing fits to female cohorts; thus, the Gompertz, Makeham, and Kannisto models systematically do a better job of capturing the patterns in male mortality than female mortality.

Next, I investigate the possibility that there are pairs or groups of models that tend to perform relatively well or to perform relatively poorly for the same cohorts. Figure 5 shows a dendrogram that describes the similarity between patterns of ΔAIC for each model across all 360 cohorts.^{5} Figure 5 suggests that a natural grouping can be obtained by picking a dissimilarity of about 200, which results in three groups of models. The first group contains the log-quadratic, Perks, Beard, logistic, and Lynch-Brown models; the second group contains the Kannisto model; and the third group contains the Gompertz and Makeham models. Figures 3 and 4 show that models from the first group—which consists of functional forms that can bend to accommodate decelerating death rates at older ages—tend to fit the data well: they rarely have ΔAIC ≥ 10, and the log-quadratic and Beard models have the highest fraction of cohorts with ΔAIC < 2. In the second group, the Kannisto model’s relative performance is more ambiguous because of the large amount of variance in the quality of its fits. For some cohorts, Kannisto performs as well as models from the first group; but for other cohorts, Kannisto’s performance is more similar to the Gompertz/Makeham models. Finally, in the third group, the Gompertz and Makeham models clearly tend to perform worse than the more flexible models in the first group, suggesting that the steady exponential increase in death rates observed at middle adult ages tends not to continue into advanced ages.

### Results by Country

I now turn to an analysis of model fits within each country, focusing on the five countries where more than 10 cohorts are observed for each sex: Denmark, France, Italy, the Netherlands, and Sweden. The remaining five countries—Belgium, West Germany, Japan, Scotland, and Switzerland—had 10 or fewer cohorts each; to avoid the risk of making misleading inferences from so few cohorts, I omit them from this country-specific analysis.

Figure 6 again uses the rules of thumb to show the fraction of cohorts for which each model has considerable support (*x*-axis) and the fraction of cohorts for which each model has essentially no support (*y*-axis) within each country and by sex.^{6} The results suggest that there are big differences in how the models fit cohorts from different countries but that within each country, there tend not to be big differences in model fits by sex. The performance of the Gompertz and the Makeham models distinguishes countries from one another. In France and Italy, the Gompertz and Makeham models clearly perform poorly, and more flexible models fit well. For Denmark, the Gompertz and Makeham models do not rank at the top, but they rarely predict death rates very poorly (ΔAIC > 10 in few cohorts). Sweden and the Netherlands are intermediate cases for which Gompertz and Makeham predict poorly less than half of the time.

### Summary and Discussion

The Gompertz model provides a remarkably good description of death rates at young and middle adult ages, and researchers have long debated whether Gomepertzian increases in death rates continue indefinitely with age (e.g., Bebbington et al. 2014; Olshansky and Carnes 1997; Thatcher et al. 1998; Vaupel et al. 1998). A key finding of my analysis is that flexible, non-Gompertzian models tend to fit old-age death rates better than Gompertz models in my sample of 360 high-quality cohorts from the Kannisto-Thatcher database.^{7} This finding supports the idea that death rates tend to decelerate at oldest ages. Critically, I reach this conclusion by using a principled approach to model selection that accounts for the fact that there are very small numbers of deaths at advanced ages.

My findings are consistent with several previous studies that have uncovered evidence of non-Gompertzian patterns in old-age death rates using other methods or data sources (Barbi et al. 2018; Horiuchi and Wilmoth 1998; Thatcher et al. 1998; Vaupel et al. 1998; Yi and Vaupel 2003). However, recent analyses have also found support for Gompertz-type models of old-age mortality in some cases. Bebbington et al. (2014) used the Human Mortality Database to find that old-age mortality patterns for most countries do not follow a Gompertz-type trajectory. However, Bebbington et al. (2014) identified three notable exceptions for which the Gompertz model fit well—Australia, Canada, and the United States—raising the possibility that there may be a group of countries for which the Gompertz model provides a good description of old-age mortality. So far, this hypothesis has mixed support. In the United States, Gavrilov and Gavrilova (2011) and Gavrilova and Gavrilov (2015) analyzed the U.S. Social Security Death Master File and found additional evidence supporting Gompertzian old-age mortality in the United States. In Canada, however, Ouellette (2016) found non-Gompertz mortality trajectories in old-age mortality among French Canadians in Quebec. Further, recent research in actuarial modeling suggests that non-Gompertzian dynamics may appear in Australian old-age mortality (Li et al. 2011). Thus, recent studies have found support for the suggestion that U.S. old-age mortality dynamics may be Gompertzian, but results for Australia and Canada have been mixed. My study cannot directly address the debate in these three countries because the Kannisto-Thatcher data from Australia, Canada, and the United States are not of the highest quality and thus are excluded from this analysis.

Although I can conclude that flexible models from the first group tend to perform better than the Gompertz/Makeham models in the third group, my results do not support the idea that any particular one of these nine models universally describes old-age mortality. The log-quadratic model most frequently provided accurate fits, but even the log-quadratic has substantial support from the data (ΔAIC ≤ 2) in only about three-quarters of the cohorts studied; thus, even the log-quadratic model cannot be said to comprehensively capture all the patterns in my sample.

I also find that patterns of model fit differ by country. Models from the Gompertz group do not reliably fit well (ΔAIC < 2) in any of the countries; however, countries can be differentiated according to the fraction of cohorts in which the Gompertz group models perform poorly (ΔAIC > 10). France and Italy stand out because in these two countries, models from the Gompertz group are clearly inadequate: they do not often fit well, and they frequently fit poorly.^{8} Sweden and the Netherlands are intermediate cases in which models from the Gompertz group tend not to provide the best fit, but the Gompertz group models perform poorly for only about half of the cohorts. Finally, in Denmark, models from the Gompertz group tend not to provide the best fit but perform the best among the countries considered here. (In Denmark, none of the models consistently perform poorly.)

Several possible explanations exist for these differing patterns by country, and these explanations are not mutually exclusive. First, the relatively large cohort sizes in Italy and France may provide enough statistical power to detect deviations from Gompertzian dynamics, whereas smaller cohort sizes in Denmark, Sweden, and the Netherlands make this more difficult, even if underlying mortality dynamics in the smaller countries are non-Gompertzian. Although this possibility seems compelling, initial investigations revealed that artificially sampling successively smaller cohort sizes from the French data did not automatically lead to better performance by the Gompertz models but instead tended to eventually produce better performance by the Kannisto model (online appendix, section D). Thus, although sample size is clearly a factor in the ability to detect deviations from non-Gompertzian mortality dynamics, it seems unlikely that it is the only explanation for the country-specific patterns reported here.

Second, country-specific differences in the timing and rate of change of death rates at oldest ages may lead to different patterns of fit by country. In support of this view, the country groupings suggested by my analysis are consistent with previous studies of levels and changes in adult mortality in Europe. Other researchers have observed a contrast between southern European countries, such as France and Italy, which showed continual improvement in adult death rates since the middle of the twentieth century, and some northern European countries, including Denmark and the Netherlands, whose adult mortality gains have been less even, with periods of relative stagnation (Luy et al. 2011; Meslé and Vallin 2006; Rau et al. 2008). Staetsky (2009) argued that these contrasting patterns in female mortality change can be explained by differences in the timing and intensity of smoking, with women in Denmark and the Netherlands starting to smoke in earlier periods than their counterparts in Italy and France. Interestingly, Staetsky (2009) pointed out that the timing of smoking in Denmark and the Netherlands closely follows the trend in the United States, a country where Bebbington et al. (2014) and Gavrilov and Gavrilova (2011) found support for Gompertz patterns of old-age mortality. Thus, my country-specific results suggest that period effects may have altered the shape of cohort mortality, such that cohorts that experienced continual improvements (France and Italy) show the strongest evidence of non-Gompertz death rates.

Third, these country-specific differences in model fits could be the result of more fundamental differences in the mortality regime of each country. These differences could be produced by country-specific differences in genetic or cultural factors—such as patterns of diet, physical activity, and social support—that are known to be important predictors of adult mortality and are thus likely to be moderators of old-age mortality (Berkman and Syme 1979; Byberg et al. 2009; Gjonca et al. 2000; Haveman-Nies et al. 2003; Vaupel et al. 2003).

Finally, I examine patterns of model fit by sex. Across all cohorts, most models show similar performance for males and for females. The exceptions are the Gompertz, Makeham, and Kannisto models, which systematically tend to fit male cohorts better than female cohorts (Fig. 4). More generally, patterns of model fit vary more across countries than across sexes; this finding is somewhat surprising because adult mortality levels are often found to be more similar among cohorts of the same sex from different countries, as opposed to cohorts of different sexes from the same country (e.g., Luy et al. 2011: fig. 3.4). Understanding this finding is an important topic for future research.

## Conclusion

The results presented in this article point to suggestions for researchers who wish to model old-age mortality in the future. First, researchers who develop and fit mortality models should use a principled model selection technique like the AIC to assess model fit. Second, if a single model is to be used to model old-age mortality hazards, my results suggest that the log-quadratic model is a reasonable choice. Other models can also perform well, although my results show that the Kannisto model—which is often used in practice—should be chosen with caution. Finally, my results suggest that researchers should be cautious about designing mortality studies that use data from only one country, particularly if cohort sizes from the country tend to be small.

This study has several limitations, and the results suggest several directions for future research. Any study of old-age mortality is limited by the quantity and quality of data available. I restrict my analysis to cohorts in the Kannisto-Thatcher database that have been identified as having the highest-quality data, but it is possible that some inaccuracies remain. Severe data inaccuracies could affect model fits and thus my conclusions.

More generally, like many analyses of old-age mortality, I am restricted to analyzing groups of people that have been defined by political boundaries (i.e., countries). This approach is reasonable because people who live in the same country are governed by the same national policy regime; further, people who live in the same country may tend to be genetically similar and may tend to live in similar cultural environments. However, a great deal of variation can exist within a single country. Most models of mortality—even those models that explicitly account for heterogeneity—are most appropriate for one *coherent* population whose members face some continuous distribution of hazards. Large and diverse countries may well be better described as finite mixtures of coherent populations. Thus, it may be productive to construct data sets for cohorts whose members are most likely to share a common characteristic mortality experience, possibly with individual variation.

My design focuses on analyzing high-quality information on deaths and exposure for entire cohorts. Other designs based on collecting more detailed information from samples of people—such as longitudinal health surveys that link respondents to death certificate data—should also be used to assess the empirical support for different theories of old-age mortality. These designs have many advantages, as well as a different set of methodological and data quality challenges; thus, longitudinal surveys can be seen as an important complement to the analysis presented here.

Finally, I hope that this analysis can be a helpful starting point for building more complex models to understand and to forecast old-age mortality. Future work could investigate how to combine these theoretical models into predictive ensembles; researchers could also build a hierarchical model that describes how cohorts in the sample are related to one another. Either or both of these approaches could be used to stochastically forecast future old-age death rates.

## Acknowledgments

The author thanks Vladimir Canudas-Romo, Scott Lynch, Matthew J. Salganik, and three anonymous reviews for helpful comments on early drafts of the manuscript.

## Notes

^{1}

In this article, I use the terms “advanced ages” and “oldest ages” to refer to ages over 80.

^{2}

Although my focus is on old-age mortality, these models can also be applied to other age ranges.

^{3}

Jdanov et al. (2008) developed methods to quantify several different ways that age patterns of deaths and population counts can be irregular. The researchers then systematically applied their methods to each cohort in the K-T database, producing comparable indicators for data quality across cohorts. Finally, they summarized their results using hierarchical clustering, producing four tiers of data quality: best, acceptable, conditionally acceptable, and weak.

^{4}

I also use several R packages, including Daróczi and Tsegelskyi (2015), R Core Team (2014), Wickham (2009), Slowikowski and Irisson (2016), Schloerke et al. (2014), and Wickham et al. (2016).

^{5}

The distance metric used in Fig. A5 in the online appendix is group average difference in ΔAIC values. Hastie et al. (2009) has more details on hierarchical clustering.

^{6}

The figure omits the Weibull model, which consistently fit poorly across all countries (Fig. 4).

^{7}

Section C in the online appendix shows that this conclusion is robust to the choice of AIC, BIC, or cross-validation as the principled model fit technique.

^{8}

This finding that France and Italy do not fit Gompertzian old-age mortality patterns is confirmed even when other methods of assessing model fit, based on different statistical theories, are considered (online appendix, section C).

## References

*La forme de la courbe de mortalité des centenaires canadiens-français*[The shape of the mortality curve of Canadian-French centenarians]