## Abstract

Researchers using the Lee-Carter approach have often assumed that the time-varying index evolves linearly and that the parameters describing the age pattern of mortality decline are time-invariant. However, as several empirical studies suggest, the two assumptions do not seem to hold when the calibration window begins too early. This problem gives rise to the question of identifying the longest calibration window for which the two assumptions hold true. To address this question, we contribute a likelihood ratio–based sequential test to jointly test whether the two assumptions are satisfied. Consistent with the mortality structural changes observed in previous studies, our testing procedure indicates that the starting points of the optimal calibration windows for most populations fall between 1960 and 1990. Using an out-of-sample analysis, we demonstrate that in most cases, models that are estimated to the optimized calibration windows result in more accurate forecasts than models that are fitted to all available data or data beyond 1950. We further apply the proposed testing procedure to data over different age ranges. We find that the optimal calibration windows for age group 0–49 are generally shorter than those for age group 50–89, indicating that mortality at younger ages might have undergone (another) structural change in recent years.

## Introduction

The Lee-Carter model (Lee and Carter 1992) and its various extensions (Booth et al. 2002; Hyndman and Booth 2008; Hyndman et al. 2013; Hyndman and Ullah 2007) have been widely used for producing projections of future mortality rates, which are required when analyzing demographic changes, pension plans, and social security policies. Unlike its predecessors, the Lee-Carter model is a stochastic model, producing not only a central mortality projection but also measures of uncertainty reflecting how the realized mortality may deviate from the central projection. (For a review of the Lee-Carter model and some of its extensions, see Booth and Tickle 2008.)

The key idea behind the Lee-Carter model is that mortality rates at all ages are driven by a single time-varying index, which is often assumed to evolve linearly over time. A collection of response parameters is devised to capture the sensitivities to the time-varying index at different ages so that mortality rates at different ages are permitted to decline at different speeds. The response parameters are assumed to be fixed, which in turn means that the model assumes a time-invariant age pattern of mortality decline. Therefore, the validity of the Lee-Carter model depends heavily on the linearity of the time-varying index and the time-invariance of the age pattern of mortality decline. Nevertheless, several empirical studies have suggested that neither assumption holds true when the Lee-Carter model is estimated to an overly long calibration window. For instance, Booth et al. (2006) and Pitacco et al. (2009) asserted that mortality decline in several developed countries has accelerated in recent decades, indicating a violation of the linearity assumption; Kannisto et al. (1994) and Horiuchi and Wilmoth (1995) noted that the difference between the paces of mortality decline at younger and older ages has become more remarkable, providing evidence against a constant age pattern of mortality decline.

Researchers have demonstrated that different calibration windows may lead to very different mortality forecasts (see, e.g., Hatzopoulos and Haberman 2009; Lee and Miller 2001; Lundström and Qvist 2004). Hence, when using the Lee-Carter model, it is important to choose a calibration window within which the two mentioned assumptions are not violated. A calibration window that is too long may lead to an overly pessimistic forecast if mortality decline has accelerated in the recent past. The opposite may be true if the data within the calibration window incorporate the mortality decline arising from, for example, the breakthrough in treating tuberculosis (Barnes et al. 2011; Mori 2000), which is not so likely to recur. On the other hand, although the two assumptions are more likely to hold within a shorter calibration window, a calibration window that is too short may not contain sufficient information about the historical random deviations. The resulting model may therefore give inadequate provision for uncertainty. The dilemma can be visualized in Fig. 1, which shows the Lee-Carter forecasts of the log central death rates for the Spanish unisex population at age 85, based on three different calibration windows: 1950–2000, 1983–2000, and 1990–2000. The forecast is the least accurate when the longest calibration window is used, possibly because it does not sufficiently reflect the acceleration of mortality decline in the recent past. The shortest calibration window yields the most accurate forecast, but it also results in confidence intervals that seem to be too narrow and fail to capture some of the realized log central death rates. The remaining calibration window seems to be a reasonable compromise.

The question of optimizing the calibration window for the Lee-Carter model has been discussed widely in literature. Most of the studies on this topic have focused on the assumption of a linear time-varying index. Optimization methods that are based on linearity tests for the time-varying index have been proposed. Booth et al. (2002) constructed a ratio to measure the loss of fit arising from the assumption of a perfectly linear time-varying index and proposed using the longest calibration window for which the ratio is reasonably small; Li et al. (2015) treated the length of the calibration window as a parameter and detected structural changes in the time-varying index by examining the posterior distribution of this additional parameter. Other optimization methods that are based on linearity tests have also been considered (see, e.g., Coelho and Nunes 2011; Li et al. 2011; Sweeting 2011; Van Berkum et al. 2016).

Fewer studies have examined the assumption of a time-invariant age pattern of mortality decline. Lee and Miller (2001) claimed that the assumption of a time-invariant age pattern of mortality decline works well in the second half of the twentieth century for several developed countries, stating that “A simple and satisfactory solution, adopted by Tuljapurkar et al. (2000), is to base the forecast on data since 1950, and to assume fixed bx over that range but not over the whole century.”1 Booth et al. (2002) demonstrated that the age pattern of mortality decline for Australians is roughly time-invariant over the calibration window identified using their proposed optimization method. More recently, Li et al. (2013) proposed an extension of the Lee-Carter model in which the age pattern of mortality decline begins to vary when the realized life expectancy reaches a predetermined threshold and ultimately becomes flat over the age range of 0–85. However, to our knowledge, no formal test for the assumption of a time-invariant age pattern of mortality decline has ever been proposed.

In this article, we attempt to fill the gap in the literature by proposing an optimization method that takes both underlying assumptions into account. To this end, we propose a sequential testing procedure, which is based on a log-likelihood ratio test for the joint hypothesis of a linear time-varying index and a time-invariant age pattern of mortality decline. In the sequential testing procedure, the log-likelihood ratio test is first applied to a very short calibration window, over which the joint hypothesis holds by default, and is then applied to successively longer calibration windows. The procedure continues until the largest calibration window satisfying the joint hypothesis is found. We apply the proposed testing procedure to mortality data from a large set of populations and find that the starting years of the optimal calibration windows for most populations lie within 1960 and 1990. Using an out-of-sample analysis, we demonstrate that in most cases, models that are estimated to the optimized calibration windows result in more accurate forecasts than models that are fitted to all available data or data beyond 1950. We further apply the proposed testing procedure to data over different age ranges. We find that the optimal calibration windows for age group 0–49 are frequently shorter than those for age group 50–89, indicating that mortality at younger ages might have undergone (another) structural change in recent years.

## The Lee-Carter Model

Let us define the following notations, which are used throughout the rest of this article.

• dx,t is the observed number of deaths at age x and in calendar year t;

• Ex,t is the number of exposures at age x and in calendar year t;

• mx,t = dx,t / Ex,t is the central death rate at age x and in calendar year t;

• Nx is the number of ages under consideration; and

• X is the collection of ages under consideration.

The Lee-Carter model can be expressed as
$logmx,t=ax+bxkt+εx,t,$
1

where ax is a parameter measuring the average level of mortality at age x, kt is a time-varying index capturing the overall level of mortality in year t, bx is a parameter quantifying the sensitivity of log mx,t to kt, and εx,t is the error term. Following Lee and Carter (1992), we stipulate parameter uniqueness by setting ax to the average of the log mx,t over the calibration window, the average of kt over the calibration window to 0, and the sum of bx for all xX to 1.

Following Wilmoth (1993) and Brouhns et al. (2002), we estimate the parameters in Eq. (2.1) using maximum likelihood. Let Dx,t be the unrealized number of deaths at age x and in year t. The estimation method assumes that for each x and t, Dx,t follows a Poisson distribution with a mean of
$λx,t=eax+bxktEx,t.$
2
Suppose that the data sample under consideration ends in year T. We let IT(s) be the calibration window running from year Ts + 1 to T. When the model is fitted to IT(s), we have the following log-likelihood under the Poisson death count assumption:
$ℓdT−s+1:TET−s+1:Tbk=∑x∈X∑t∈ITsdx,tlogλx,t−λx,t−logdx,t!.$
3

In Eq. (3), d[T − s + 1:T] and E[T − s + 1:T] represent the collection of dx,t and Ex,t for all xX and tIT(s), respectively, whereas b and k are the vectors of bx and kt, respectively. By substituting Eq. (2) into Eq. (3) and by maximizing Eq. (3) with respect to b and k, we obtain the maximum likelihood estimates of b and k given the data over IT(s). We denote these ML estimates as bT,s and kT,s, respectively. To generate mortality forecasts, the evolution of kt has to be modeled by a time-series process.

Following Lee and Carter (1992), we use a random walk with drift:
$kt=c+kt−1+ωt,ωt~i.i.d.N0σω2,$
4
where c and $σω2$ are the drift and (squared) volatility parameters, respectively. We let $θ=cσω2′$ be the vector of the parameters in Eq. (4), and θT,s be the ML estimate of θ when Eq. (4) is fitted to $kT,s=kT−s+1T,s…kTT,s′$. We can obtain θT,s easily by maximizing the following log-likelihood with respect to c and $σω2$:
$ℓkT,sθ=−s2logσω2∑t=T−s+2TktT,s−c−kt−1T,s2.$
5
For convenience, we let B = (b, θ) and use BT,s to represent the ML estimate of B given the data over IT(s). Using Eqs. (1) and (4) and ignoring the error term εx,t in Eq. (1),2 we can express the log central death rate in a future year T + μ as
$logmx,T+μ=ax+bxkT+μc+∑t=T+1T+μωt=logmx,T+bxμc+∑t=T+1T+μωt,$

for μ = 1, 2, . . . , which depends on B and log mx,T but no other model parameters. This means that BT,s and log mx,T contain all information needed for generating mortality forecasts. In other words, having obtained θT,s in BT,s, kT,s no longer plays a role in the forecasting procedure.

## The Sequential Testing Procedure

### Defining Homogeneity

In this section, we introduce the sequential testing procedure that can be used to identify the optimal calibration window for the Lee-Carter model. The testing procedure is built on the concept of homogeneity. We say that the Lee-Carter model is homogeneous over a calibration window IT(s) if for all shorter calibration windows {IT(u), u = s, s − 1, . . .}, b does not depend on u (i.e., b is time-invariant) and k is linear with a gradient and volatility that do not depend on u.

We define the optimal calibration window as the longest calibration window over which the Lee-Carter model is homogeneous. The sequential testing procedure identifies such a calibration window as follows. Let smin and smax be the shortest and longest possible lengths of the calibration window for the model, respectively. By default, IT(smin) is small enough that homogeneity must hold. The testing procedure iteratively tests whether the Lee-Carter model is homogeneous over calibration windows IT(smin + 1), IT(smin + 2), . . . We construct a joint log-likelihood ratio test for this purpose. The procedure stops when homogeneity no longer holds or when IT(smax) is reached, whichever is the earliest. In practice, smax can be set to the length of the entire data sample or an arbitrary large integer.

In what follows, we first introduce the test statistic for the joint log-likelihood ratio test. Then we explain in further detail how the sequential testing procedure is implemented. Finally, we conclude this section with a description on how the critical values of the joint log-likelihood ratio test are calibrated.

### The Test Statistic

Let us consider an arbitrary calibration window IT(s), where smin< s ≤ smax. If the Lee-Carter model is homogeneous over IT(s), then the following two properties should hold for any u, where smin≤ u < s.

• Property 1: Let kT,s(bT,u) be the constrained estimate of k, obtained by maximizing Eq. (3) with respect to k over IT(s) given the constraint b = bT,u. If homogeneity holds over IT(s), then b is time-invariant, and thus kT,s(bT,u) should to be close to kT,s.3

• Property 2: If homogeneity holds over IT(s), then k is linear with the same gradient and volatility for any IT(u) with smin≤ u < s. Consequently, θT,u and θT,s should provide a similar fit to kT,s because θT,u and θT,s are estimates of the same (unknown) parameter vector θ.

It follows from these two properties that if homogeneity holds over IT(s), then the fit of θT,u to kT,s(bT,u) (a constrained fit) should be similar to the fit of θT,s to kT,s (an unconstrained fit). Mathematically, if homogeneity holds over IT(s), then the log-likelihood ratio,
$LRITsBT,sBT,u=ℓkT,sθT,s−ℓkT,sbT,uθT,u1/2,$
6

where the log-likelihoods are calculated using Eq. (5), should be small (but not necessarily zero given the randomness in the data sample) for all smin≤ u < s. We use the log-likelihood ratio as the test statistic for the homogeneity hypothesis. More specifically, if for some smin≤ u < s the log-likelihood ratio LR(IT(s),BT,s,BT,u) is too large compared with a precalibrated critical value, then we reject the null hypothesis that the Lee-Carter model is homogeneous over IT(s).

It is important to recognize that the proposed log-likelihood ratio test is a joint test for the time-invariance of b and the linearity of k. This is because a violation of either one or both of these Lee-Carter assumptions would result in the absence of Properties 1 or 2, or both. Any one of these three situations would lead to a large log-likelihood ratio and consequently a better chance of rejecting the (joint) null hypothesis.

### Implementing the Test Procedure

We implement the sequential testing procedure as follows. Suppose that the Lee-Carter model is homogeneous over IT(s − 1). We test whether homogeneity still holds over IT(s) with the following log-likelihood ratio test statistic:
$LRITsBT,s,BT,s-1.$

If the test statistic exceeds the critical value ξs, then the homogeneity hypothesis is rejected. In this case, we regard IT(s − 1) as the optimal calibration window (the longest calibration window over which the Lee-Carter model is homogenous) and BT,s − 1 as the optimal estimate of B. Otherwise, we do not reject the homogeneity hypothesis and repeat the test for the next larger calibration window. Noting that homogeneity holds over the smallest calibration window IT(smin) by default, the procedure begins at s = smin + 1. The procedure ends when the first rejection of the homogeneity hypothesis occurs or when s reaches smax, whichever is the earliest.

### Calibrating the Critical Values

Calibrating the critical values is not a straightforward task. As Chen et al. (2010) and Chen and Niu (2014) noted, the sampling distribution of the log-likelihood ratio in Eq. (6) is not known, even asymptotically. To overcome this challenge, we use a Monte Carlo experiment, which entails three major components: (1) the generation of random pseudo-samples, (2) the calculation of the benchmark expected log-likelihood ratio, and (3) iteratively solving for the critical values from the benchmark expected log-likelihood ratio, which are explained in the rest of this subsection. We provide only the information necessary for the application in question. See Chen et al. (2010) and Chen and Niu (2014) for the complete theory.

#### Generation of Random Pseudo-Samples

The calibration of critical values requires a large number (say, Ns) of random pseudo-samples of death counts. These random pseudo-samples are generated using the following procedure.

1. Simulate Ns realizations of $kT−smax+1$ . . . , kT from the following random walk with drift:

$ktj=c∗+kT−smax∗+ωtjt=T−smax+1c∗+kt−1j+ωtjt=T−smax+2,…,T,$

where $ωtj~i.d.d.N0σω∗2$ for t = T − smax + 1, . . . , T and j = 1,..., Ns. We use $kj=kT−smax+1j. . . kTj′$ to represent the jth set of simulated kt.

2. Given k(j), simulate the death count for each xX and tIT(smax) from a Poisson distribution with a mean of $eax∗bx∗ktjEx,t∗$. We use dx,t(j) to represent the jth simulated death count for age x and year t, and d[Ts + 1:T](j) to represent the collection of dx,t(j) over IT(s).

This procedure depends on a collection of hypothetical Lee-Carter parameters, $ax∗$, $bx∗$, c, $σω∗2$, and $kT−smax∗$, as well as some hypothetical (constant) exposure counts, $Ex,t∗$. We use b* as shorthand to represent the vector of $bx∗$; we use $ET−s+1:T∗$ to represent the collection of $Ex,t∗$ over IT(s); and we let $θ∗=c∗σω∗2′$ and $B∗=θ∗′b∗′′$.

We obtain the hypothetical parameters by fitting the Lee-Carter model to the mortality data from the U.S. unisex population aged 0–89 in 1976–2010. When generating the Ns = 3,000 random pseudo-samples, we assume smin = 10, smax = 70, and T = 2011. Accordingly, we use the exposure counts for the U.S. unisex population aged 0–89 for the period 1941–2010 as the collection of hypothetical exposure counts.

Note that the hypothetical parameters and exposure counts used in the simulation procedure do not necessarily need to come from the population to which the test for homogeneity is applied. In the upcoming sections on the optimal calibration windows and out-of-sample forecasting performance, the same collection of critical values (based on the same hypothetical parameters and exposure counts) are used to test the homogeneity hypothesis for every population under consideration. In Online Resource 1, we examine the sensitivity of the critical values to the hypothetical parameters and exposure counts, and the results indicate that the sensitivity is minimal.

#### Calculation of the Benchmark Expected Log-Likelihood Ratio

For all Ns pseudo–random samples, homogeneity holds over IT(smin),  . . . , IT(smax) because the death counts are consistently generated from the same collection of hypothetical Lee-Carter parameters and exposure counts. Thus, we can define the benchmark expected log-likelihood ratio based on the largest calibration window IT(smax) as follows:
$ℜ=EℓkT,smaxθT,smax−ℓkT,smaxb∗θ∗1/2=1Ns∑j=1NsℓkT,smaxjθT,smaxj−ℓkT,smaxb∗jθ∗1/2,$

where the log-likelihoods are calculated using Eq. (5). In the benchmark expected log-likelihood ratio,

• $kT,smaxj$ is the (unconstrained) ML estimate of k(j) when Eq. (1) is fitted to the jth set of simulated death counts $dT−smax+1:Tj$ and the hypothetical exposures $ET−smax+1:T∗$.

• $θT,smaxj$ is the (unconstrained) ML estimate of θ* when Eq. (4) is fitted to $kT,smaxj$.

• $kT,smaxb∗j$ is the constrained ML estimate of k(j) when Eq. (1) is fitted to $dT−smax+1:Tj$ and $ET−smax+1:T∗$; it is obtained by maximizing the log-likelihood in Eq. (3) with respect to k given the constraint b = b*.

It follows that $ℓkT,smaxjθT,smaxj$ measures the fit of Eq. (4) when no constraint is applied (an unconstrained fit), whereas $ℓkT,smaxjb∗θ∗$ measures the fit of Eq. (4) when the parameter set B is fixed to be the true (hypothetical) parameter set B* (a constrained fit). The benchmark expected log-likelihood ratio ℜ thus measures the expected difference between the constrained and unconstrained fits of Eq. (4) that is due entirely to sampling errors (rather than any violation of homogeneity). Of course, when homogeneity is violated, the expected difference between the constrained and unconstrained fits of Eq. (4) over IT(smax) would be greater than what is measured by ℜ. As such, we can set the critical values by making reference to ℜ.

#### Solving for the Critical Values From the Benchmark Expected Log-Likelihood Ratio

We are now ready to explain how the critical values are obtained. First, let us define the following:
$B˜T,qξsmin...ξsj=b˜T,qξsmin...ξsj′θ˜T,qξsmin...ξsj′,q=s,s+1,...,smax,$

which we call the homogeneous estimator for the jth pseudo-sample. The argument $ξsmin,...,ξs,j$ is to emphasize the dependence of the homogenous estimator on the critical values. This dependence will become clearer as we explain how the values of the homogeneous estimators are calculated. Intuitively, $B˜T,qξsmin...ξsj$ can be regarded as the best possible estimate of B* given the data from the jth pseudo-sample over IT(q) and the critical values $ξsmin,...,ξs$, in the sense that it is based on the longest calibration window over which homogeneity is deemed to hold according to $ξsmin,...,ξs$.

For s = smin, we set $B˜T,qξsminj=BT,qj$ for each j and q = smin, . . . ,smax; that is, for each pseudo-sample, the homogeneous estimators are set equal to the corresponding ML estimates. Also, we set $ξsmin=∞$ because homogeneity holds on IT(smin) by default.

For s = smin + 1, given a fixed value of $ξsmin+1$, we calculate the values of the homogeneous estimators with the following algorithm.

1. For each j = 1, . . . , Ns, compute the value of

$LRITsmin+1BT,smin.+1jB˜T,sminξsminj,$

using Eq. (6). This step is feasible because we know that $B˜T,sminξsminj=BT,sminj$.

2. For the jth pseudo-sample, if

$LRITsmin+1BT,smin+1jB˜T,sminξsminj≤ξsmin+1,$
then the unconstrained fit (based entirely on the data over IT(smin + 1)) can be regarded as similar to the constrained fit (with the homogeneous estimator for s = smin being the constraint). In other words, based on the given critical value ξsmin + 1, homogeneity over the longer calibration window IT(smin + 1) is not violated. Hence, we update the homogeneous estimator for s = q = smin + 1 by setting it to the ML estimate over IT(smin + 1); that is, we set $B˜T,smin+1ξsminξsmin+1j=BT,smin+1j$. Moreover, because the critical values beyond s = smin + 1 are not given in this step, we set $B˜T,qξsminξsmin+1j=BT,qj$ for every q = smin + 2, . . . , smax because this is the best we can do given $ξsmin$ and $ξsmin+1$ only. Overall, the update in this step would result in a zero log-likelihood ratio:
$LRITqBT,qjB˜T,qξsminξsmin+1j=0,$

for q = smin + 1, smin + 2, . . . , smax.

3. Contrarily, for the jth pseudo-sample, if

$LRITsmin+1BT,smin+1jB˜T,sminξsminj>ξsmin+1,$
then the difference between the constrained and unconstrained fits can be regarded as too large; hence, based on $ξsmin+1$, homogeneity no longer holds over IT(smin + 1). In this case, we do not update the homogenous estimator; that is, we set $B˜T,qξsminξsmin+1j=B˜T,sminξsminj$ for every q = smin + 1, . . . , smax. The log likelihood ratio for q = smin + 1, . . . , smax then becomes
$LRITqBT,qjB˜T,qξsminξsmin+1j=LRITqBT,qjB˜T,sminξsminj.$
In this algorithm, the critical value $ξsmin+1$ is the only unknown. It can be neither too large nor too small. If $ξsmin+1$ is too small, then the homogeneity hypothesis may be rejected even when homogeneity holds. On the other hand, if $ξsmin+1$ is too large, then the homogeneity hypothesis may not be rejected even if homogeneity does not hold, making the power of the test too weak. In the extreme case of $ξsmin+1=∞$, the homogeneity hypothesis would never be rejected regardless of whether homogeneity holds. As discussed earlier, we choose $ξsmin+1$ by making reference to the benchmark expected log-likelihood ratio ℜ. In particular, the chosen value of $ξsmin+1$ is the minimum value of $ξsmin+1$ that satisfies all the following inequalities:
$ELRITqBT,qB˜T,qξsminξsmin+1=1Ns∑j=1NsLRITqBT,qjB˜T,qξsminξsmin+1j≤1smax−sminℜ,q=smin+1,…,smax.$

The coefficient (1/smax − smin) is included to correct for the uncertainty in estimation caused by the sequential testing procedure (see Chen et al. 2010). With $ξsmin+1$ having been determined, the values of the homogeneous estimators for s = smin + 1, q = smin + 1, . . . , smax, and all j = 1, . . . , Ns can be finalized.

We are now ready to move on to s = smin + 2. Given the values of the homogeneous estimators for s = smin + 1, we can readily calculate the value of
$LRITsmin+2BT,smin+2jB˜T,smin+1ξsminξsmin+1j$
for each j = 1, . . . , Nj. Also, for a fixed value of $ξsmin+2$, we can determine for s = smin + 2 the homogeneous estimator,
$B˜T,qξsminξsmin+1ξsmin+2j,$
and the log-likelihood ratio,
$LRITqBT,qjB˜T,qξsminξsmin+1ξsmin+2j,$
for every q = smin + 2, . . . , smax and j = 1, . . . , Ns, in a manner similar to that described in Steps 2 and 3 in the algorithm for s = smin + 1. Finally, the chosen value of $ξsmin+2$ is the minimum value of $ξsmin+2$ that satisfies all the following inequalities:
$ELRITqBT,qB˜T,qξsminξsmin+1ξsmin+2=1Ns∑j=1NsLRITqBT,qjB˜T,qξsminξsmin+1ξsmin+2j≤2smax−sminℜ,q=smin+2,…,smax.$

The subsequent critical values $ξsmin+3,…,ξsmax$ are determined in a similar fashion.

## Application to Mortality Data

We now apply the sequential testing procedure to mortality data from 34 unisex populations. The data we use are crude central death rates and exposure counts by single year of age from 0 to 89 and are downloaded from the Human Mortality Database.4 For each population, we set smin to 10, smax to 70, and T to the ending year of the population’s data set.

### The Optimal Calibration Windows

Table 1 shows the beginning years of the optimal calibration windows for all populations under consideration. All optimal calibration windows begin after 1950, and most of the optimal calibration windows begin some time between 1960 and 1990. Our results echo the conclusion from previous findings that post-1950 data better represent current mortality patterns than do pre-1950 data (Lee and Miller 2001; Pitacco et al. 2009). Among the 34 populations, two have very short optimal calibration windows: Ukraine (16 years) and Iceland (12 years). This outcome may be attributed to the very small exposure and death counts for these two populations. During the year when their data samples end, Ukraine had around 4.57 million exposures and 67,000 deaths, whereas Iceland had 32,000 exposures and 1,787 deaths. The outcome also reveals a potential caveat of our proposed method: when the volume of data is low, the parameter estimates are subject to excessive sampling errors, and consequently the log-likelihood ratio test might incorrectly regard a change in b and/or k due to sampling errors as a structural change.

Next, we examine how the estimates of b and k might change if the Lee-Carter model is fitted to suboptimal calibration windows. Figure 2 displays the estimates of k for the U.S. and Japanese unisex populations, based on the optimal calibration window and the longest possible calibration window.5 For both populations, the estimated k over the optimal calibration window is highly linear, but the linearity is not preserved over the longer (suboptimal) calibration windows.6 For the U.S. population, we observe two structural breakpoints at around 1955 and 1970. For the Japanese population, we find no obvious structural breakpoint, but the reduction in kt is somewhat slower over the recent decades.

Figure 3 depicts the estimates of b for the U.S. and Japanese unisex populations, based on the optimal calibration window, the longest possible calibration window, and the calibration window that is 15 years shorter than the optimal one. Compared to the optimal calibration window, the longest possible calibration window yields higher values of bx for ages 20 to 50. This observation may be explained by the remarkable reduction in infectious disease mortality from 1940s to 1970s. It was demonstrated by Armstrong et al. (1999) that for individuals aged 20–44 in the United States, the mortality arising from nine infectious diseases (pneumonia and influenza, tuberculosis, diphtheria, pertussis, measles, typhoid fever, dysentery, syphilis, and AIDS) reduced substantially over the period 1930 to 1965. As a specific example, let us consider tuberculosis mortality. For the United States, the percentage of total deaths caused by tuberculosis decreased from 4.33 % in 1940 to only 0.27 % in 1970.7 For Japan, this proportion declined from 12.24 % in 1947 to 1.54 % in 1975 (Japan International Cooperation Agency 2005: chapter 5). Because tuberculosis mostly affects working-age adults (WHO n.d.), the drastic reduction in tuberculosis deaths contributes significantly to the decline in mortality for ages 20–50 over the period 1940–1970. Therefore, if the data prior to 1970 are excluded, the decrease in working-age mortality due to the breakthrough in treating tuberculosis (and other infectious diseases) is no longer reflected in the model, leading to lower estimates of bx for ages 20–50. We also observe that the longest possible calibration window results in lower values of bx for ages 50–90 compared with the optimal calibration window. This observation is in line with the acceleration of the decline in old-age mortality in recent decades (see, e.g., Li et al. 2013).

Compared with the optimal calibration window, the calibration window that is 15 years shorter yields estimated bx that exhibit a similar pattern but more sampling variation. The increased sampling variation is expected given that the number of data points on which the parameter estimates are based is smaller.

For the U.S. population, the shorter calibration window results in higher estimated bx for ages 30–40 but lower estimated bx for ages 40–60, although these differences are not deemed as significant by the sequential testing procedure. The higher estimated bx for ages 30–40 may be attributed to the cohort effect of smoking prevalence. As reported by Anderson et al. (2012) and the U.S. Department of Health and Human Services (1991: chapter 3), in the United States, the cohorts born during 1921–1930 (males) and 1931–1940 (females) have the largest proportion of ever-smokers. In the first quinquennium (1967–1971) of the optimal sample period, these cohorts were aged 37–41, on average. The high smoking prevalence among these cohorts might have dragged the mortality decline for the 30–40 age group during the period 1967–1971, and therefore excluding data over this period would result in higher estimated bx for ages 30–40.8 On the other hand, the lower estimated bx for ages 40–60 might be due to the increase in mortality rates among non-Hispanic whites from 1999 to 2013. See Case and Deaton (2015) for a discussion of the possible reasons behind such an increase in mortality.

### Out-of-Sample Forecasting Performance

In this subsection, we examine how forecasting performance may vary if a longer than optimal calibration window is used. We consider two alternative calibration windows, which begin in (1) 1950 and (2) either 1900 or the year in which the population’s data set begins, whichever is the latest.9

We apply the sequential testing procedure and estimate the model parameters based on data up to 2000 only, regardless of when the calibration window begins. The estimated model is then used to forecast the post-2000 mortality, and the data beyond 2000 are used to evaluate the model’s forecasting performance. We measure forecasting performance by the mean absolute percentage error (MAPE), defined as
$MAPE=1Nx1μ∑x∈X∑t=20012000+μlogmx,t−loĝmx,tlogmx,t,$

where μ is the forecast horizon, and log mx , t and $loĝmx,t$ are the observed and forecasted log central death rates, respectively. For each population, we determine μ in such a way that 2000 + μ is the ending year of the population’s entire data sample. A smaller value of the MAPE indicates a better forecasting performance.

Table 2 compares the MAPEs resulting from the three different calibration windows. It includes only 18 populations because populations whose data samples begin later than 1950 must be excluded in this evaluation. For 15 of the 18 populations, the optimal calibration window results in the best out-of-sample forecast performance, supporting the use of the proposed sequential testing procedure over the two arbitrary beginning points.

The three exceptions (Portugal, Finland, and Spain) merit a closer scrutiny. For these three exceptions, the calibration window leading to the smallest MAPE begins in either 1940 (Portugal) or 1950 (Finland and Spain). Figure 4 shows the actual log central death rates, averaged over the age range 0–89, from 1940/1950 to 2000 + μ. Also shown in Fig. 4 are the corresponding forecasted values from 2001 to 2000 + μ, based on the calibration window optimized by the sequential testing procedure and the calibration window yielding the smallest MAPE.10 For all three populations, the evolution of the average log central death rates is highly nonlinear over the longer calibration window, which is deemed suboptimal by the sequential testing procedure but yields a smaller MAPE. Specifically, the reduction in mortality was fast in the early years of the longer calibration window (1940 to mid-1950s for Portugal; 1950 to 1980 for Finland and Spain), slowed for certain decades (1955 to mid-1990s for Portugal; 1980 to mid-1990s for Finland and Spain), and then accelerated. With data up to 2000 only, the sequential testing procedure may not have enough information to detect the potential structural break in the mid-1990s; thus, the resulting optimal calibration windows include the period of slower mortality decline, which in turn lead to an underestimation of the latest mortality decline. On the other hand, because the longer calibration windows begin with a period of rapid mortality decline, they coincidentally produce more accurate out-of-sample forecasts.

When evaluating the out-of-sample forecasting performance, the sequential testing procedure is applied to data up to 2000 only. Thus, for some populations, the optimal calibration windows identified here are different from those found earlier. For example, the optimal calibration window for the Finnish unisex population begins in 1993 if all data are considered, and in 1975 if the post-2000 data are excluded. The exclusion of the post-2000 data may lead to an optimal calibration window that begins earlier, possibly because multiple structural changes may exist, and the sequential testing procedure may not have sufficient information to detect the latest possible structural change when the most recent data are excluded.

### A Comparison With Booth’s Method

Booth et al. (2002) proposed a method (hereafter termed Booth’s method) to determine the longest calibration window over which the linearity of k holds. In what follows, we briefly describe Booth’s method and compare it with our proposed sequential testing method.

Booth’s method is built on the deviance of the Lee-Carter model. For a Lee-Carter model fitted to data over IT(s), the deviance can be calculated as
$deviancebases=2∑ITs∑x∈Xdx,tlogdx,tλx,t−dx,t−λx,t,$
7

where λx,t is defined in Eq. (2), and ax, bx, and kt in λx,t are set to their respective ML estimates. This deviance measure assesses the model’s lack of fit: the higher the deviance, the greater the lack of fit. Booth et al. (2002) specifically referred to Eq. (7) as “the base lack of fit.”

To measure how much additional lack of fit is created when k is perfectly linear, Booth et al. (2002) defined another measure referred to as “the total lack of fit,” denoted by deviancetotal(s). The definition of deviancetotal(s) is the same as that of deviancebase(s), except that the ML estimates of kt are replaced with kt that lie perfectly on a straight line that has a gradient of c (the drift term in Eq. (4)) and passes through the average of the (revised) kt at the midpoint of IT(s).

In Booth’s method, the optimal calibration window is identified by making reference to the ratio of the total lack of fit to the base lack of fit (corrected for the degrees of freedom). For IT(s), such a ratio is calculated as follows:
$Rs=deviancetotals/Nxs−2deviancebases/Nx−1s−2.$

The ratio is calculated for all possible values of s. If, R(s′ + 1) is substantially larger than R(s′) for some s = s′, then IT(s′) is considered as the optimal calibration window because the linearity in k results in too much additional lack of fit as we extend the calibration window from IT(s′) to IT(s′ + 1).

Our proposed method has two distinct features that make it stand out from Booth’s. First, our sequential testing procedure is based on a joint test of the linearity of k and the time-invariance of b. Although Booth et al. (2002) found that the time-invariance of b holds better over the calibration window they selected, they did not explicitly include the time-invariance of b in their selection procedure. Second, our sequential testing procedure is based on a series of objectively determined critical values; by contrast, Booth’s method is based on a subjectively (and visually) chosen threshold for the values of R(s), and for this reason, it is not always easy to identify the optimal calibration window with Booth’s method. To illustrate this potential problem, in Fig. 5, we show two series of R(s) that are, respectively, calculated using data from the Belgian and UK unisex populations aged 0–89. Based on the R(s) series, it seems optimal to begin the calibration window for the Belgian unisex population in 1968. However, the conclusion for the UK unisex population is unclear because of the pattern and volatility of its R(s) series.

### Application to Narrower Age Ranges

In this subsection, we examine how the test results may change when the test is applied to age ranges narrower than 0–89. We consider two nonoverlapping age ranges: 0–49 and 50–89. To perform the additional tests, we calibrate two new sets of critical values with data from the U.S. unisex population over these two age ranges. The new results are shown in Table 3. Comparing the new results with the original results in Table 1, we notice that for many of the populations, the length of the optimal calibration window for the entire age range is in between the lengths of the optimal calibration windows for the two smaller age ranges. Also, for most of the populations, the optimal calibration window for ages 0–49 begins later than that for ages 50–89. It follows that these populations might have been subject to a recent structural change that applies exclusively to the mortality at younger ages.

Among all 34 populations, four populations (Spain, Austria, Switzerland, and New Zealand) have an optimal calibration window for the entire age range that is smaller than the optimal calibration windows for both of the smaller age ranges. For these populations, it is possible that b over each age range is somewhat invariant; however, when the entire age range is considered, the variation in b becomes more obvious and hence the homogeneity hypothesis is rejected earlier in the sequential testing procedure.

The results for the Swedish unisex population are particularly interesting. For this population, the optimal calibration windows for both smaller age ranges begin in 1993, whereas that for the entire age range begins in 1976. In other words, the optimal calibration window for the entire age range is almost twice as long as those for the two age ranges. To explain why the optimal calibration window for ages 50–89, is much shorter, we compare, for each of the three age ranges considered, the estimates of the drift term c when the model is fitted to data over 1976–2011 and 1993–2011, as shown in Table 4. As the beginning point of the calibration window changes from 1976 to 1993, the magnitude of the estimated drift term remains almost constant for ages 0–49, increases significantly by 26 % for ages 50–89, and increases moderately by 13 % for ages 0–89. The sequential testing procedure can pick up the possible structural change in k happening around 1993 when it is applied to the smaller age range of 50–89 but cannot when it is applied to the entire age range because the effect of the possible structural change in k (which applies exclusively to the older ages) is mitigated when data for the younger ages are included. To explain why the optimal calibration window for ages 0–49 is much shorter, we turn to Fig. 6, in which we show the estimates of b (for the entire age range) when the model is fitted to data for 1976–2011 and 1993–2011. As the calibration window widens, the estimates of bx for ages below 50 change rather significantly, but those for ages 50 and older do not. When the test is applied to the smaller age range of ages 0–49, the effect of the possible structural change in bx for ages below 50 is strong enough to trigger a rejection of the homogeneity hypothesis as the beginning point of the calibration window moves from 1993 to 1992. However, when the test is applied to the entire age range, the effect of the possible structural change in bx for ages below 50 is mitigated and is not strong enough to trigger a rejection in the same step.

## Concluding Remarks

In this article, we introduce a sequential testing procedure that allows us to identify the optimal calibration window for the Lee-Carter model. Through a joint log-likelihood ratio test, our proposed procedure incorporates both of the typical Lee-Carter underlying assumptions (linearity in the time-varying index and time-invariance of the age pattern of mortality decline) into the calibration window selection process. Our proposed procedure is more objective than Booth’s method because it is based on objectively determined critical values instead of a subjectively (and visually) chosen threshold. In addition, our proposed procedure may be applied to a wider range of mortality data sets because the patterns of the R(s) series in Booth’s method are inconclusive for some data sets.

We apply the proposed testing procedure to mortality data from 34 populations and find that the starting years of the optimal calibration windows for most of the populations lie between 1960 and 1990. Using an out-of-sample analysis, we demonstrate that for most of the populations, the Lee-Carter model that is estimated to the optimized calibration windows results in more accurate forecasts than one that is fitted to a typically used calibration window (e.g., one that begins in 1900 or 1950). We further apply the proposed testing procedure to data for different age ranges, and we demonstrate (with reasons) that the optimal calibration window may change as the age range changes. This point is useful to those who intend to produce Lee-Carter mortality projections for a restricted range of ages only.

The caveats noted previously are repeated here. First, when the volume of data is low, the parameter estimates may be subject to excessive sampling errors; thus, the log-likelihood ratio test might incorrectly regard a change in b and/or k due to sampling errors as a structural change. Second, the log-likelihood ratio test may not be able to identify a structural change that occurred near the end of the calibration window because of the lack of information on which the test can be based.

We also acknowledge that our proposed method may not be ideal if the change in parameter values is only temporary. Consider a hypothetical situation in which mortality data from 1900 to 2016 are available. Suppose that the data-generating process is the Lee-Carter model, whose parameters remain constant since 1900 except during 1955–1960. In such a situation, our procedure would detect a breakpoint in 1960. The resulting optimal calibration window (1961–2016) would exclude 1900–1954, over which the data are relevant.

In future research, it would be interesting to develop an extension for use with multipopulation mortality models. In such an extension, the homogeneity hypothesis should contain more dimensions. For example, for the augmented common factor model proposed by Li and Lee (2005), the homogeneity hypothesis should, at a minimum, encompass (1) the time-invariance of the common age patterns of mortality decline, (2) the time-invariance of the population-specific age patterns of mortality decline, and (3) the linearity of the common time-varying index.

## Notes

1

In the Lee-Carter model, parameter bx measures the sensitivity of the log central death rate at age x to the time-varying index. The collection of bx parameters represents the age pattern of mortality decline.

2

Lee and Carter (1992) also ignored the error term in Eq. (1). In fact, they did not even impose any distributional assumption on the error term.

3

Note that kT,s can be written as kT,s(bT,s).

4

The Human Mortality Database (www.mortality.org) covers 39 countries/regions; we consider 34 of them in this article. We exclude Chile, Greece, Israel, Slovenia, and Taiwan because the starting years of the data sets for these countries/regions are later than 1960.

5

The longest calibration windows for the U.S. and Japanese unisex populations begin, respectively, in 1933 and 1947 (the years in which their mortality data sets begin).

6

The difference between the levels of the two kt series in each diagram is due to the constraint $∑t∈ITskt=0$.

8

In the second quinquennium (1972–1976) of the optimal sample period, these cohorts were aged 42–46, on average. Admittedly, the high smoking prevalence among these cohorts does not explain why the bx estimates for this age group are lower when the shorter calibration window is used. The empirical fact that the shorter calibration window gives lower estimates of bx should be attributed to other reasons, such as that concerning non-Hispanic whites.

9

When evaluating their optimization method, Booth et al. (2002) consider similar alternative calibration windows; one begins in 1950 and the other begins in 1907 (the year in which the Australian mortality data set begins).

10

Note that the MAPE also reflects errors in the age dimension, which are not observed in the across-age average log rates shown in Fig. 4.

## References

Anderson, C. M., Burns, D. M., Dodd, K. W., & Feuer, E. J. (
2012
).
Birth-cohort-specific estimates of smoking behaviors for the US population
.
Risk Analysis
,
32
(
Suppl. 1
),
S14
S24
. 10.1111/j.1539-6924.2011.01703.x.
Armstrong, G. L., Conn, L. A., & Pinner, R. W. (
1999
).
Trends in infectious disease mortality in the United States during the 20th century
.
JAMA
,
281
,
61
66
. 10.1001/jama.281.1.61.
Barnes, R. F., Moore, M. L., Garfein, R. S., Brodine, S., Strathdee, S. A., & Rodwell, T. C. (
2011
).
Trends in mortality of tuberculosis patients in the United States: The long-term perspective
.
Annals of Epidemiology
,
21
,
791
795
. 10.1016/j.annepidem.2011.07.002.
Booth, H., Hyndman, R., Tickle, L., & De Jong, P. (
2006
).
Lee-Carter mortality forecasting: A multi-country comparison of variants and extensions
.
Demographic Research
,
15
(
article 9
),
289
310
. doi:10.4054/DemRes.2006.15.9
Booth, H., Maindonald, J., & Smith, L. (
2002
).
Applying Lee-Carter under conditions of variable mortality decline
.
Population Studies
,
56
,
325
336
. 10.1080/00324720215935.
Booth, H., & Tickle, L. (
2008
).
Mortality modelling and forecasting: A review of methods
.
Annals of Actuarial Science
,
3
,
3
43
. 10.1017/S1748499500000440.
Brouhns, N., Denuit, M., & Vermunt, J. K. (
2002
).
A Poisson log-bilinear regression approach to the construction of projected lifetables
.
Insurance: Mathematics and Economics
,
31
,
373
393
.
Case, A., & Deaton, A. (
2015
).
Rising morbidity and mortality in midlife among white non-Hispanic Americans in the 21st century
.
Proceedings of the National Academy of Sciences
,
112
,
15078
15083
.
Chen, Y., Härdle, W. K., & Pigorsch, U. (
2010
).
Localized realized volatility modeling
.
Journal of the American Statistical Association
,
105
,
1376
1393
. 10.1198/jasa.2010.ap09039.
Chen, Y., & Niu, L. (
2014
).
Adaptive dynamic Nelson–Siegel term structure model with applications
.
Journal of Econometrics
,
180
,
98
115
. 10.1016/j.jeconom.2014.02.009.
Coelho, E., & Nunes, L. C. (
2011
).
Forecasting mortality in the event of a structural change
.
Journal of the Royal Statistical Society, Series A: Statistics in Society
,
174
,
713
736
. 10.1111/j.1467-985X.2010.00687.x.
Hatzopoulos, P., & Haberman, S. (
2009
).
A parameterized approach to modeling and forecasting mortality
.
Insurance: Mathematics and Economics
,
44
,
103
123
.
Horiuchi, S., & Wilmoth, J. (
1995
,
April
).
Aging of mortality decline
.
Paper presented at the annual meeting of the Population Association of America
,
San Francisco, CA
.
Hyndman, R. J., & Booth, H. (
2008
).
Stochastic population forecasts using functional data models for mortality, fertility and migration
.
International Journal of Forecasting
,
24
,
323
342
. 10.1016/j.ijforecast.2008.02.009.
Hyndman, R. J., Booth, H., & Yasmeen, F. (
2013
).
Coherent mortality forecasting: The product-ratio method with functional time series models
.
Demography
,
50
,
261
283
. 10.1007/s13524-012-0145-5.
Hyndman, R. J., & Ullah, M. S. (
2007
).
Robust forecasting of mortality and fertility rates: A functional data approach
.
Computational Statistics & Data Analysis
,
51
,
4942
4956
. 10.1016/j.csda.2006.07.028.
Japan International Cooperation Agency
. (
2005
).
Japan’s experiences in public health and medical systems
(Technical report).
Tokyo, Japan
:
Institute for International Cooperation
.
Kannisto, V., Lauritsen, J., Thatcher, A. R., & Vaupel, J. W. (
1994
).
Reductions in mortality at advanced ages: Several decades of evidence from 27 countries
.
Population and Development Review
,
20
,
793
810
. 10.2307/2137662.
Lee, R., & Miller, T. (
2001
).
Evaluating the performance of the Lee-Carter method for forecasting mortality
.
Demography
,
38
,
537
549
. 10.1353/dem.2001.0036.
Lee, R. D., & Carter, L. R. (
1992
).
Modeling and forecasting US mortality
.
Journal of the American Statistical Association
,
87
,
659
671
.
Li, H., De Waegenaere, A., & Melenberg, B. (
2015
).
The choice of sample size for mortality forecasting: A Bayesian learning approach
.
Insurance: Mathematics and Economics
,
63
,
153
168
.
Li, J. S.-H., Chan, W.-S., & Cheung, S.-H. (
2011
).
Structural changes in the Lee-Carter mortality indexes: Detection and implications
.
North American Actuarial Journal
,
15
,
13
31
. 10.1080/10920277.2011.10597607.
Li, N., & Lee, R. (
2005
).
Coherent mortality forecasts for a group of populations: An extension of the Lee-Carter method
.
Demography
,
42
,
575
594
. 10.1353/dem.2005.0021.
Li, N., Lee, R., & Gerland, P. (
2013
).
Extending the Lee-Carter method to model the rotation of age patterns of mortality decline for long-term projections
.
Demography
,
50
,
2037
2051
. 10.1007/s13524-013-0232-2.
Lundström, H., & Qvist, J. (
2004
).
Mortality forecasting and trend shifts: An application of the Lee–Carter model to Swedish mortality data
.
International Statistical Review
,
72
,
37
50
. 10.1111/j.1751-5823.2004.tb00222.x.
Mori, T. (
2000
).
Recent trends in tuberculosis, Japan
.
Emerging Infectious Diseases
,
6
,
566
568
. 10.3201/eid0606.000602.
Pitacco, E., Denuit, M., Haberman, S., & Olivieri, A. (
2009
).
Modelling longevity dynamics for pensions and annuity business
.
New York, NY
:
Oxford University Press
.
Sweeting, P. (
2011
).
A trend-change extension of the Cairns-Blake-Dowd model
.
Annals of Actuarial Science
,
5
,
143
162
. 10.1017/S1748499511000017.
Tuljapurkar, S., Li, N., & Boe, C. (
2000
).
A universal pattern of mortality decline in the G7 countries
.
Nature
,
405
,
789
792
. 10.1038/35015561.
U.S. Department of Health and Human Services
. (
1991
).
Strategies to control tobacco use in the United States: A blueprint for public health action in the 1990’s
(Smoking and Tobacco Control Monograph No. 1).
Bethesda, MD
:
Division of Cancer Prevention and Control, National Cancer Institute
.
Van Berkum, F., Antonio, K., & Vellekoop, M. (
2016
).
The impact of multiple structural changes on mortality predictions
.
Scandinavian Actuarial Journal
,
7
,
581
603
. 10.1080/03461238.2014.987807.
Wilmoth, J. R. (
1993
).
Computational methods for fitting and extrapolating the Lee-Carter model of mortality change
(Technical report).
Berkeley
:
Department of Demography, University of California, Berkeley
.
World Health Organization (WHO)
. (n.d.).
Tuberculosis
(Fact sheet).
Geneva, Switzerland
:
WHO