## Abstract

Researchers using the Lee-Carter approach have often assumed that the time-varying index evolves linearly and that the parameters describing the age pattern of mortality decline are time-invariant. However, as several empirical studies suggest, the two assumptions do not seem to hold when the calibration window begins too early. This problem gives rise to the question of identifying the longest calibration window for which the two assumptions hold true. To address this question, we contribute a likelihood ratio–based sequential test to jointly test whether the two assumptions are satisfied. Consistent with the mortality structural changes observed in previous studies, our testing procedure indicates that the starting points of the optimal calibration windows for most populations fall between 1960 and 1990. Using an out-of-sample analysis, we demonstrate that in most cases, models that are estimated to the optimized calibration windows result in more accurate forecasts than models that are fitted to all available data or data beyond 1950. We further apply the proposed testing procedure to data over different age ranges. We find that the optimal calibration windows for age group 0–49 are generally shorter than those for age group 50–89, indicating that mortality at younger ages might have undergone (another) structural change in recent years.

## Introduction

The Lee-Carter model (Lee and Carter 1992) and its various extensions (Booth et al. 2002; Hyndman and Booth 2008; Hyndman et al. 2013; Hyndman and Ullah 2007) have been widely used for producing projections of future mortality rates, which are required when analyzing demographic changes, pension plans, and social security policies. Unlike its predecessors, the Lee-Carter model is a stochastic model, producing not only a central mortality projection but also measures of uncertainty reflecting how the realized mortality may deviate from the central projection. (For a review of the Lee-Carter model and some of its extensions, see Booth and Tickle 2008.)

The key idea behind the Lee-Carter model is that mortality rates at all ages are driven by a single time-varying index, which is often assumed to evolve linearly over time. A collection of response parameters is devised to capture the sensitivities to the time-varying index at different ages so that mortality rates at different ages are permitted to decline at different speeds. The response parameters are assumed to be fixed, which in turn means that the model assumes a time-invariant age pattern of mortality decline. Therefore, the validity of the Lee-Carter model depends heavily on the linearity of the time-varying index and the time-invariance of the age pattern of mortality decline. Nevertheless, several empirical studies have suggested that neither assumption holds true when the Lee-Carter model is estimated to an overly long calibration window. For instance, Booth et al. (2006) and Pitacco et al. (2009) asserted that mortality decline in several developed countries has accelerated in recent decades, indicating a violation of the linearity assumption; Kannisto et al. (1994) and Horiuchi and Wilmoth (1995) noted that the difference between the paces of mortality decline at younger and older ages has become more remarkable, providing evidence against a constant age pattern of mortality decline.

Researchers have demonstrated that different calibration windows may lead to very different mortality forecasts (see, e.g., Hatzopoulos and Haberman 2009; Lee and Miller 2001; Lundström and Qvist 2004). Hence, when using the Lee-Carter model, it is important to choose a calibration window within which the two mentioned assumptions are not violated. A calibration window that is too long may lead to an overly pessimistic forecast if mortality decline has accelerated in the recent past. The opposite may be true if the data within the calibration window incorporate the mortality decline arising from, for example, the breakthrough in treating tuberculosis (Barnes et al. 2011; Mori 2000), which is not so likely to recur. On the other hand, although the two assumptions are more likely to hold within a shorter calibration window, a calibration window that is too short may not contain sufficient information about the historical random deviations. The resulting model may therefore give inadequate provision for uncertainty. The dilemma can be visualized in Fig. 1, which shows the Lee-Carter forecasts of the log central death rates for the Spanish unisex population at age 85, based on three different calibration windows: 1950–2000, 1983–2000, and 1990–2000. The forecast is the least accurate when the longest calibration window is used, possibly because it does not sufficiently reflect the acceleration of mortality decline in the recent past. The shortest calibration window yields the most accurate forecast, but it also results in confidence intervals that seem to be too narrow and fail to capture some of the realized log central death rates. The remaining calibration window seems to be a reasonable compromise.

The question of optimizing the calibration window for the Lee-Carter model has been discussed widely in literature. Most of the studies on this topic have focused on the assumption of a linear time-varying index. Optimization methods that are based on linearity tests for the time-varying index have been proposed. Booth et al. (2002) constructed a ratio to measure the loss of fit arising from the assumption of a perfectly linear time-varying index and proposed using the longest calibration window for which the ratio is reasonably small; Li et al. (2015) treated the length of the calibration window as a parameter and detected structural changes in the time-varying index by examining the posterior distribution of this additional parameter. Other optimization methods that are based on linearity tests have also been considered (see, e.g., Coelho and Nunes 2011; Li et al. 2011; Sweeting 2011; Van Berkum et al. 2016).

Fewer studies have examined the assumption of a time-invariant age pattern of mortality decline. Lee and Miller (2001) claimed that the assumption of a time-invariant age pattern of mortality decline works well in the second half of the twentieth century for several developed countries, stating that “A simple and satisfactory solution, adopted by Tuljapurkar et al. (2000), is to base the forecast on data since 1950, and to assume fixed *b*_{x} over that range but not over the whole century.”^{1} Booth et al. (2002) demonstrated that the age pattern of mortality decline for Australians is roughly time-invariant over the calibration window identified using their proposed optimization method. More recently, Li et al. (2013) proposed an extension of the Lee-Carter model in which the age pattern of mortality decline begins to vary when the realized life expectancy reaches a predetermined threshold and ultimately becomes flat over the age range of 0–85. However, to our knowledge, no formal test for the assumption of a time-invariant age pattern of mortality decline has ever been proposed.

In this article, we attempt to fill the gap in the literature by proposing an optimization method that takes *both* underlying assumptions into account. To this end, we propose a sequential testing procedure, which is based on a log-likelihood ratio test for the joint hypothesis of a linear time-varying index and a time-invariant age pattern of mortality decline. In the sequential testing procedure, the log-likelihood ratio test is first applied to a very short calibration window, over which the joint hypothesis holds by default, and is then applied to successively longer calibration windows. The procedure continues until the largest calibration window satisfying the joint hypothesis is found. We apply the proposed testing procedure to mortality data from a large set of populations and find that the starting years of the optimal calibration windows for most populations lie within 1960 and 1990. Using an out-of-sample analysis, we demonstrate that in most cases, models that are estimated to the optimized calibration windows result in more accurate forecasts than models that are fitted to all available data or data beyond 1950. We further apply the proposed testing procedure to data over different age ranges. We find that the optimal calibration windows for age group 0–49 are frequently shorter than those for age group 50–89, indicating that mortality at younger ages might have undergone (another) structural change in recent years.

## The Lee-Carter Model

Let us define the following notations, which are used throughout the rest of this article.

*d*_{x,t}is the observed number of deaths at age*x*and in calendar year*t*;*E*_{x,t}is the number of exposures at age*x*and in calendar year*t*;*m*_{x,t}=*d*_{x,t}/*E*_{x,t}is the central death rate at age*x*and in calendar year*t*;*N*_{x}is the number of ages under consideration; and**X**is the collection of ages under consideration.

where *a*_{x} is a parameter measuring the average level of mortality at age *x*, *k*_{t} is a time-varying index capturing the overall level of mortality in year *t*, *b*_{x} is a parameter quantifying the sensitivity of log *m*_{x,t} to *k*_{t}, and ε_{x,t} is the error term. Following Lee and Carter (1992), we stipulate parameter uniqueness by setting *a*_{x} to the average of the log *m*_{x,t} over the calibration window, the average of *k*_{t} over the calibration window to 0, and the sum of *b*_{x} for all *x* ∈ **X** to 1.

*D*

_{x,t}be the unrealized number of deaths at age

*x*and in year

*t*. The estimation method assumes that for each

*x*and

*t*,

*D*

_{x,t}follows a Poisson distribution with a mean of

*T*. We let

*I*

_{T}(

*s*) be the calibration window running from year

*T*−

*s*+ 1 to

*T*. When the model is fitted to

*I*

_{T}(

*s*), we have the following log-likelihood under the Poisson death count assumption:

In Eq. (3), **d**_{[T − s + 1:T]} and **E**_{[T − s + 1:T]} represent the collection of *d*_{x,t} and *E*_{x,t} for all *x* ∈ **X** and *t* ∈ *I*_{T}(*s*), respectively, whereas **b** and **k** are the vectors of *b*_{x} and *k*_{t}, respectively. By substituting Eq. (2) into Eq. (3) and by maximizing Eq. (3) with respect to **b** and **k**, we obtain the maximum likelihood estimates of **b** and **k** given the data over *I*_{T}(*s*). We denote these ML estimates as **b**^{T,s} and **k**^{T,s}, respectively. To generate mortality forecasts, the evolution of *k*_{t} has to be modeled by a time-series process.

*c*and $\sigma \omega 2$ are the drift and (squared) volatility parameters, respectively. We let $\theta =c\sigma \omega 2\u2032$ be the vector of the parameters in Eq. (4), and θ

^{T,s}be the ML estimate of θ when Eq. (4) is fitted to $kT,s=kT\u2212s+1T,s\u2026kTT,s\u2032$. We can obtain θ

^{T,s}easily by maximizing the following log-likelihood with respect to

*c*and $\sigma \omega 2$:

**B**= (

**b**

^{′}, θ

^{′})

^{′}and use

**B**

^{T,s}to represent the ML estimate of

**B**given the data over

*I*

_{T}(

*s*). Using Eqs. (1) and (4) and ignoring the error term ε

_{x,t}in Eq. (1),

^{2}we can express the log central death rate in a future year

*T*+ μ as

for μ = 1*,* 2, . . . , which depends on **B** and log *m*_{x,T} but no other model parameters. This means that **B**^{T,s} and log *m*_{x,T} contain all information needed for generating mortality forecasts. In other words, having obtained θ^{T,s} in **B**^{T,s}, **k**^{T,s} no longer plays a role in the forecasting procedure.

## The Sequential Testing Procedure

### Defining Homogeneity

In this section, we introduce the sequential testing procedure that can be used to identify the optimal calibration window for the Lee-Carter model. The testing procedure is built on the concept of homogeneity. We say that the Lee-Carter model is *homogeneous* over a calibration window *I*_{T}(*s*) if for all shorter calibration windows {*I*_{T}(*u*)*, u* = *s, s −* 1, . . .}, **b** does not depend on *u* (i.e., **b** is time-invariant) and **k** is linear with a gradient and volatility that do not depend on *u*.

We define the optimal calibration window as the longest calibration window over which the Lee-Carter model is homogeneous. The sequential testing procedure identifies such a calibration window as follows. Let *s*_{min} and *s*_{max} be the shortest and longest possible lengths of the calibration window for the model, respectively. By default, *I*_{T}(*s*_{min}) is small enough that homogeneity must hold. The testing procedure iteratively tests whether the Lee-Carter model is homogeneous over calibration windows *I*_{T}(*s*_{min} + 1), *I*_{T}(*s*_{min} + 2), . *.* . We construct a joint log-likelihood ratio test for this purpose. The procedure stops when homogeneity no longer holds or when *I*_{T}(*s*_{max}) is reached, whichever is the earliest. In practice, *s*_{max} can be set to the length of the entire data sample or an arbitrary large integer.

In what follows, we first introduce the test statistic for the joint log-likelihood ratio test. Then we explain in further detail how the sequential testing procedure is implemented. Finally, we conclude this section with a description on how the critical values of the joint log-likelihood ratio test are calibrated.

### The Test Statistic

Let us consider an arbitrary calibration window *I*_{T}(*s*), where *s*_{min}*< s ≤ s*_{max}. If the Lee-Carter model is homogeneous over *I*_{T}(*s*), then the following two properties should hold for any *u*, where *s*_{min}*≤ u < s*.

**Property 1:**Let**k**^{T,s}(**b**^{T,u}) be the*constrained*estimate of**k**, obtained by maximizing Eq. (3) with respect to**k**over*I*_{T}(*s*) given the constraint**b**=**b**^{T,u}. If homogeneity holds over*I*_{T}(*s*), then**b**is time-invariant, and thus**k**^{T,s}(**b**^{T,u}) should to be close to**k**^{T,s}.^{3}**Property 2:**If homogeneity holds over*I*_{T}(*s*), then**k**is linear with the same gradient and volatility for any*I*_{T}(*u*) with*s*_{min}*≤ u < s*. Consequently, θ^{T,u}and θ^{T,s}should provide a similar fit to**k**^{T,s}because θ^{T,u}and θ^{T,s}are estimates of the same (unknown) parameter vector θ.

*I*

_{T}(

*s*), then the fit of θ

^{T,u}to

**k**

^{T,s}(

**b**

^{T,u}) (a constrained fit) should be similar to the fit of θ

^{T,s}to

**k**

^{T,s}(an unconstrained fit). Mathematically, if homogeneity holds over

*I*

_{T}(

*s*), then the log-likelihood ratio,

where the log-likelihoods are calculated using Eq. (5), should be small (but not necessarily zero given the randomness in the data sample) for all *s*_{min}*≤ u < s*. We use the log-likelihood ratio as the test statistic for the homogeneity hypothesis. More specifically, if for some *s*_{min}*≤ u < s* the log-likelihood ratio LR(*I*_{T}(*s*)*,***B**^{T,s}*,***B**^{T,u}) is too large compared with a precalibrated critical value, then we reject the null hypothesis that the Lee-Carter model is homogeneous over *I*_{T}(*s*).

It is important to recognize that the proposed log-likelihood ratio test is a joint test for the time-invariance of **b** and the linearity of **k**. This is because a violation of either one or both of these Lee-Carter assumptions would result in the absence of Properties 1 or 2, or both. Any one of these three situations would lead to a large log-likelihood ratio and consequently a better chance of rejecting the (joint) null hypothesis.

### Implementing the Test Procedure

*I*

_{T}(

*s −*1). We test whether homogeneity still holds over

*I*

_{T}(

*s*) with the following log-likelihood ratio test statistic:

If the test statistic exceeds the critical value ξ*s*, then the homogeneity hypothesis is rejected. In this case, we regard *I*_{T}(*s −* 1) as the optimal calibration window (the longest calibration window over which the Lee-Carter model is homogenous) and **B**^{T,s − 1} as the optimal estimate of **B**. Otherwise, we do not reject the homogeneity hypothesis and repeat the test for the next larger calibration window. Noting that homogeneity holds over the smallest calibration window *I*_{T}(*s*_{min}) by default, the procedure begins at *s* = *s*_{min} + 1. The procedure ends when the first rejection of the homogeneity hypothesis occurs or when *s* reaches *s*_{max}, whichever is the earliest.

### Calibrating the Critical Values

Calibrating the critical values is not a straightforward task. As Chen et al. (2010) and Chen and Niu (2014) noted, the sampling distribution of the log-likelihood ratio in Eq. (6) is not known, even asymptotically. To overcome this challenge, we use a Monte Carlo experiment, which entails three major components: (1) the generation of random pseudo-samples, (2) the calculation of the benchmark expected log-likelihood ratio, and (3) iteratively solving for the critical values from the benchmark expected log-likelihood ratio, which are explained in the rest of this subsection. We provide only the information necessary for the application in question. See Chen et al. (2010) and Chen and Niu (2014) for the complete theory.

#### Generation of Random Pseudo-Samples

The calibration of critical values requires a large number (say, *N*_{s}) of random pseudo-samples of death counts. These random pseudo-samples are generated using the following procedure.

Simulate

*N*_{s}realizations of $kT\u2212smax+1$ . . . ,*k*_{T}from the following random walk with drift:$ktj=c\u2217+kT\u2212smax\u2217+\omega tjt=T\u2212smax+1c\u2217+kt\u22121j+\omega tjt=T\u2212smax+2,\u2026,T,$where $\omega tj~i.d.d.N0\sigma \omega \u22172$ for

*t*=*T − s*_{max}+ 1, . . . ,*T*and*j*= 1,...,*N*_{s}. We use $kj=kT\u2212smax+1j. . . kTj\u2032$ to represent the*j*th set of simulated*k*_{t}.Given

**k**(*j*), simulate the death count for each*x*∈**X**and*t*∈*I*_{T}(*s*_{max}) from a Poisson distribution with a mean of $eax\u2217bx\u2217ktjEx,t\u2217$. We use*d*_{x,t}(*j*) to represent the*j*th simulated death count for age*x*and year*t*, and**d**_{[T − s + 1:T]}(*j*) to represent the collection of*d*_{x,t}(*j*) over*I*_{T}(*s*).

This procedure depends on a collection of hypothetical Lee-Carter parameters, $ax\u2217$, $bx\u2217$, *c*^{∗}, $\sigma \omega \u22172$, and $kT\u2212smax\u2217$, as well as some hypothetical (constant) exposure counts, $Ex,t\u2217$. We use **b**^{*} as shorthand to represent the vector of $bx\u2217$; we use $ET\u2212s+1:T\u2217$ to represent the collection of $Ex,t\u2217$ over *I*_{T}(*s*); and we let $\theta \u2217=c\u2217\sigma \omega \u22172\u2032$ and $B\u2217=\theta \u2217\u2032b\u2217\u2032\u2032$.

We obtain the hypothetical parameters by fitting the Lee-Carter model to the mortality data from the U.S. unisex population aged 0–89 in 1976–2010. When generating the *N*_{s} = 3,000 random pseudo-samples, we assume *s*_{min} = 10, *s*_{max} = 70, and *T* = 2011. Accordingly, we use the exposure counts for the U.S. unisex population aged 0–89 for the period 1941–2010 as the collection of hypothetical exposure counts.

Note that the hypothetical parameters and exposure counts used in the simulation procedure do not necessarily need to come from the population to which the test for homogeneity is applied. In the upcoming sections on the optimal calibration windows and out-of-sample forecasting performance, the same collection of critical values (based on the same hypothetical parameters and exposure counts) are used to test the homogeneity hypothesis for every population under consideration. In Online Resource 1, we examine the sensitivity of the critical values to the hypothetical parameters and exposure counts, and the results indicate that the sensitivity is minimal.

#### Calculation of the Benchmark Expected Log-Likelihood Ratio

*N*

_{s}pseudo–random samples, homogeneity holds over

*I*

_{T}(

*s*

_{min}), . . . ,

*I*

_{T}(

*s*

_{max}) because the death counts are consistently generated from the same collection of hypothetical Lee-Carter parameters and exposure counts. Thus, we can define the benchmark expected log-likelihood ratio based on the largest calibration window

*I*

_{T}(

*s*

_{max}) as follows:

where the log-likelihoods are calculated using Eq. (5). In the benchmark expected log-likelihood ratio,

$kT,smaxj$ is the (unconstrained) ML estimate of

**k**(*j*) when Eq. (1) is fitted to the*j*th set of simulated death counts $dT\u2212smax+1:Tj$ and the hypothetical exposures $ET\u2212smax+1:T\u2217$.$\theta T,smaxj$ is the (unconstrained) ML estimate of θ

^{*}when Eq. (4) is fitted to $kT,smaxj$.$kT,smaxb\u2217j$ is the constrained ML estimate of

**k**(*j*) when Eq. (1) is fitted to $dT\u2212smax+1:Tj$ and $ET\u2212smax+1:T\u2217$; it is obtained by maximizing the log-likelihood in Eq. (3) with respect to**k**given the constraint**b**=**b**^{*}.

It follows that $\u2113kT,smaxj\theta T,smaxj$ measures the fit of Eq. (4) when no constraint is applied (an unconstrained fit), whereas $\u2113kT,smaxjb\u2217\theta \u2217$ measures the fit of Eq. (4) when the parameter set **B** is fixed to be the true (hypothetical) parameter set **B**^{*} (a constrained fit). The benchmark expected log-likelihood ratio ℜ thus measures the expected difference between the constrained and unconstrained fits of Eq. (4) that is due entirely to sampling errors (rather than any violation of homogeneity). Of course, when homogeneity is violated, the expected difference between the constrained and unconstrained fits of Eq. (4) over *I*_{T}(*s*_{max}) would be greater than what is measured by ℜ. As such, we can set the critical values by making reference to ℜ.

#### Solving for the Critical Values From the Benchmark Expected Log-Likelihood Ratio

which we call the homogeneous estimator for the *j*th pseudo-sample. The argument $\xi smin,...,\xi s,j$ is to emphasize the dependence of the homogenous estimator on the critical values. This dependence will become clearer as we explain how the values of the homogeneous estimators are calculated. Intuitively, $B\u02dcT,q\xi smin...\xi sj$ can be regarded as the best possible estimate of **B**^{*} given the data from the *j*th pseudo-sample over *I*_{T}(*q*) and the critical values $\xi smin,...,\xi s$, in the sense that it is based on the longest calibration window over which homogeneity is deemed to hold according to $\xi smin,...,\xi s$.

For *s* = *s*_{min}, we set $B\u02dcT,q\xi sminj=BT,qj$ for each *j* and *q* = *s*_{min}, . . . ,*s*_{max}; that is, for each pseudo-sample, the homogeneous estimators are set equal to the corresponding ML estimates. Also, we set $\xi smin=\u221e$ because homogeneity holds on *I*_{T}(*s*_{min}) by default.

For *s* = *s*_{min }+ 1, given a fixed value of $\xi smin+1$, we calculate the values of the homogeneous estimators with the following algorithm.

For each

*j*= 1, . . . ,*N*_{s}, compute the value of$LRITsmin+1BT,smin.+1jB\u02dcT,smin\xi sminj,$using Eq. (6). This step is feasible because we know that $B\u02dcT,smin\xi sminj=BT,sminj$.

For the

*j*th pseudo-sample, if$LRITsmin+1BT,smin+1jB\u02dcT,smin\xi sminj\u2264\xi smin+1,$then the unconstrained fit (based entirely on the data over*I*_{T}(*s*_{min}+ 1)) can be regarded as similar to the constrained fit (with the homogeneous estimator for*s*=*s*_{min}being the constraint). In other words, based on the given critical value ξ*s*_{min + 1}, homogeneity over the longer calibration window*I*_{T}(*s*_{min}+ 1) is not violated. Hence, we update the homogeneous estimator for*s*=*q*=*s*_{min}+ 1 by setting it to the ML estimate over*I*_{T}(*s*_{min}+ 1); that is, we set $B\u02dcT,smin+1\xi smin\xi smin+1j=BT,smin+1j$. Moreover, because the critical values beyond*s*=*s*_{min}+ 1 are not given in this step, we set $B\u02dcT,q\xi smin\xi smin+1j=BT,qj$ for every*q*=*s*_{min}+ 2, . . . ,*s*_{max}because this is the best we can do given $\xi smin$ and $\xi smin+1$ only. Overall, the update in this step would result in a zero log-likelihood ratio:$LRITqBT,qjB\u02dcT,q\xi smin\xi smin+1j=0,$for

*q*=*s*_{min}+ 1,*s*_{min}+ 2, . . . ,*s*_{max}.Contrarily, for the

*j*th pseudo-sample, if$LRITsmin+1BT,smin+1jB\u02dcT,smin\xi sminj>\xi smin+1,$then the difference between the constrained and unconstrained fits can be regarded as too large; hence, based on $\xi smin+1$, homogeneity no longer holds over*I*_{T}(*s*_{min}+ 1). In this case, we do not update the homogenous estimator; that is, we set $B\u02dcT,q\xi smin\xi smin+1j=B\u02dcT,smin\xi sminj$ for every*q*=*s*_{min}+ 1, . . . ,*s*_{max}. The log likelihood ratio for*q*=*s*_{min}+ 1, . . . ,*s*_{max}then becomes$LRITqBT,qjB\u02dcT,q\xi smin\xi smin+1j=LRITqBT,qjB\u02dcT,smin\xi sminj.$In this algorithm, the critical value $\xi smin+1$ is the only unknown. It can be neither too large nor too small. If $\xi smin+1$ is too small, then the homogeneity hypothesis may be rejected even when homogeneity holds. On the other hand, if $\xi smin+1$ is too large, then the homogeneity hypothesis may not be rejected even if homogeneity does not hold, making the power of the test too weak. In the extreme case of $\xi smin+1=\u221e$, the homogeneity hypothesis would never be rejected regardless of whether homogeneity holds. As discussed earlier, we choose $\xi smin+1$ by making reference to the benchmark expected log-likelihood ratio ℜ. In particular, the chosen value of $\xi smin+1$ is the minimum value of $\xi smin+1$ that satisfies all the following inequalities:$ELRITqBT,qB\u02dcT,q\xi smin\xi smin+1=1Ns\u2211j=1NsLRITqBT,qjB\u02dcT,q\xi smin\xi smin+1j\u22641smax\u2212smin\u211c,q=smin+1,\u2026,smax.$The coefficient (1/

*s*_{max}−*s*_{min}) is included to correct for the uncertainty in estimation caused by the sequential testing procedure (see Chen et al. 2010). With $\xi smin+1$ having been determined, the values of the homogeneous estimators for*s*=*s*_{min}+ 1,*q*=*s*_{min}+ 1, . . . ,*s*_{max}, and all*j*= 1, . . . ,*N*_{s}can be finalized.We are now ready to move on to*s*=*s*_{min}+ 2. Given the values of the homogeneous estimators for*s*=*s*_{min}+ 1, we can readily calculate the value of$LRITsmin+2BT,smin+2jB\u02dcT,smin+1\xi smin\xi smin+1j$for each*j*= 1, . . . ,*N*_{j}. Also, for a fixed value of $\xi smin+2$, we can determine for*s*=*s*_{min}+ 2 the homogeneous estimator,$B\u02dcT,q\xi smin\xi smin+1\xi smin+2j,$and the log-likelihood ratio,$LRITqBT,qjB\u02dcT,q\xi smin\xi smin+1\xi smin+2j,$for every*q*=*s*_{min}+ 2, . . . ,*s*_{max}and*j*= 1, . . . ,*N*_{s}, in a manner similar to that described in Steps 2 and 3 in the algorithm for*s*=*s*_{min}+ 1. Finally, the chosen value of $\xi smin+2$ is the minimum value of $\xi smin+2$ that satisfies all the following inequalities:$ELRITqBT,qB\u02dcT,q\xi smin\xi smin+1\xi smin+2=1Ns\u2211j=1NsLRITqBT,qjB\u02dcT,q\xi smin\xi smin+1\xi smin+2j\u22642smax\u2212smin\u211c,q=smin+2,\u2026,smax.$The subsequent critical values $\xi smin+3,\u2026,\xi smax$ are determined in a similar fashion.

## Application to Mortality Data

We now apply the sequential testing procedure to mortality data from 34 unisex populations. The data we use are crude central death rates and exposure counts by single year of age from 0 to 89 and are downloaded from the Human Mortality Database.^{4} For each population, we set *s*_{min} to 10, *s*_{max} to 70, and *T* to the ending year of the population’s data set.

### The Optimal Calibration Windows

Table 1 shows the beginning years of the optimal calibration windows for all populations under consideration. All optimal calibration windows begin after 1950, and most of the optimal calibration windows begin some time between 1960 and 1990. Our results echo the conclusion from previous findings that post-1950 data better represent current mortality patterns than do pre-1950 data (Lee and Miller 2001; Pitacco et al. 2009). Among the 34 populations, two have very short optimal calibration windows: Ukraine (16 years) and Iceland (12 years). This outcome may be attributed to the very small exposure and death counts for these two populations. During the year when their data samples end, Ukraine had around 4.57 million exposures and 67,000 deaths, whereas Iceland had 32,000 exposures and 1,787 deaths. The outcome also reveals a potential caveat of our proposed method: when the volume of data is low, the parameter estimates are subject to excessive sampling errors, and consequently the log-likelihood ratio test might incorrectly regard a change in **b** and/or **k** due to sampling errors as a structural change.

Next, we examine how the estimates of **b** and **k** might change if the Lee-Carter model is fitted to suboptimal calibration windows. Figure 2 displays the estimates of **k** for the U.S. and Japanese unisex populations, based on the optimal calibration window and the longest possible calibration window.^{5} For both populations, the estimated **k** over the optimal calibration window is highly linear, but the linearity is not preserved over the longer (suboptimal) calibration windows.^{6} For the U.S. population, we observe two structural breakpoints at around 1955 and 1970. For the Japanese population, we find no obvious structural breakpoint, but the reduction in *k*_{t} is somewhat slower over the recent decades.

Figure 3 depicts the estimates of **b** for the U.S. and Japanese unisex populations, based on the optimal calibration window, the longest possible calibration window, and the calibration window that is 15 years shorter than the optimal one. Compared to the optimal calibration window, the longest possible calibration window yields higher values of *b*_{x} for ages 20 to 50. This observation may be explained by the remarkable reduction in infectious disease mortality from 1940s to 1970s. It was demonstrated by Armstrong et al. (1999) that for individuals aged 20–44 in the United States, the mortality arising from nine infectious diseases (pneumonia and influenza, tuberculosis, diphtheria, pertussis, measles, typhoid fever, dysentery, syphilis, and AIDS) reduced substantially over the period 1930 to 1965. As a specific example, let us consider tuberculosis mortality. For the United States, the percentage of total deaths caused by tuberculosis decreased from 4.33 % in 1940 to only 0.27 % in 1970.^{7} For Japan, this proportion declined from 12.24 % in 1947 to 1.54 % in 1975 (Japan International Cooperation Agency 2005: chapter 5). Because tuberculosis mostly affects working-age adults (WHO n.d.), the drastic reduction in tuberculosis deaths contributes significantly to the decline in mortality for ages 20–50 over the period 1940–1970. Therefore, if the data prior to 1970 are excluded, the decrease in working-age mortality due to the breakthrough in treating tuberculosis (and other infectious diseases) is no longer reflected in the model, leading to lower estimates of *b*_{x} for ages 20–50. We also observe that the longest possible calibration window results in lower values of *b*_{x} for ages 50–90 compared with the optimal calibration window. This observation is in line with the acceleration of the decline in old-age mortality in recent decades (see, e.g., Li et al. 2013).

Compared with the optimal calibration window, the calibration window that is 15 years shorter yields estimated *b*_{x} that exhibit a similar pattern but more sampling variation. The increased sampling variation is expected given that the number of data points on which the parameter estimates are based is smaller.

For the U.S. population, the shorter calibration window results in higher estimated *b*_{x} for ages 30–40 but lower estimated *b*_{x} for ages 40–60, although these differences are not deemed as significant by the sequential testing procedure. The higher estimated *b*_{x} for ages 30–40 may be attributed to the cohort effect of smoking prevalence. As reported by Anderson et al. (2012) and the U.S. Department of Health and Human Services (1991: chapter 3), in the United States, the cohorts born during 1921–1930 (males) and 1931–1940 (females) have the largest proportion of ever-smokers. In the first quinquennium (1967–1971) of the optimal sample period, these cohorts were aged 37–41, on average. The high smoking prevalence among these cohorts might have dragged the mortality decline for the 30–40 age group during the period 1967–1971, and therefore excluding data over this period would result in higher estimated *b*_{x} for ages 30–40.^{8} On the other hand, the lower estimated *b*_{x} for ages 40–60 might be due to the increase in mortality rates among non-Hispanic whites from 1999 to 2013. See Case and Deaton (2015) for a discussion of the possible reasons behind such an increase in mortality.

### Out-of-Sample Forecasting Performance

In this subsection, we examine how forecasting performance may vary if a longer than optimal calibration window is used. We consider two alternative calibration windows, which begin in (1) 1950 and (2) either 1900 or the year in which the population’s data set begins, whichever is the latest.^{9}

where μ is the forecast horizon, and log *m*_{x , t} and $log\u0302mx,t$ are the observed and forecasted log central death rates, respectively. For each population, we determine μ in such a way that 2000 + μ is the ending year of the population’s entire data sample. A smaller value of the MAPE indicates a better forecasting performance.

Table 2 compares the MAPEs resulting from the three different calibration windows. It includes only 18 populations because populations whose data samples begin later than 1950 must be excluded in this evaluation. For 15 of the 18 populations, the optimal calibration window results in the best out-of-sample forecast performance, supporting the use of the proposed sequential testing procedure over the two arbitrary beginning points.

The three exceptions (Portugal, Finland, and Spain) merit a closer scrutiny. For these three exceptions, the calibration window leading to the smallest MAPE begins in either 1940 (Portugal) or 1950 (Finland and Spain). Figure 4 shows the actual log central death rates, averaged over the age range 0–89, from 1940/1950 to 2000 + μ. Also shown in Fig. 4 are the corresponding forecasted values from 2001 to 2000 + μ, based on the calibration window optimized by the sequential testing procedure and the calibration window yielding the smallest MAPE.^{10} For all three populations, the evolution of the average log central death rates is highly nonlinear over the longer calibration window, which is deemed suboptimal by the sequential testing procedure but yields a smaller MAPE. Specifically, the reduction in mortality was fast in the early years of the longer calibration window (1940 to mid-1950s for Portugal; 1950 to 1980 for Finland and Spain), slowed for certain decades (1955 to mid-1990s for Portugal; 1980 to mid-1990s for Finland and Spain), and then accelerated. With data up to 2000 only, the sequential testing procedure may not have enough information to detect the potential structural break in the mid-1990s; thus, the resulting optimal calibration windows include the period of slower mortality decline, which in turn lead to an underestimation of the latest mortality decline. On the other hand, because the longer calibration windows begin with a period of rapid mortality decline, they coincidentally produce more accurate out-of-sample forecasts.

When evaluating the out-of-sample forecasting performance, the sequential testing procedure is applied to data up to 2000 only. Thus, for some populations, the optimal calibration windows identified here are different from those found earlier. For example, the optimal calibration window for the Finnish unisex population begins in 1993 if all data are considered, and in 1975 if the post-2000 data are excluded. The exclusion of the post-2000 data may lead to an optimal calibration window that begins earlier, possibly because multiple structural changes may exist, and the sequential testing procedure may not have sufficient information to detect the latest possible structural change when the most recent data are excluded.

### A Comparison With Booth’s Method

Booth et al. (2002) proposed a method (hereafter termed *Booth’s method*) to determine the longest calibration window over which the linearity of **k** holds. In what follows, we briefly describe Booth’s method and compare it with our proposed sequential testing method.

*I*

_{T}(

*s*), the deviance can be calculated as

where λ_{x,t} is defined in Eq. (2), and *a*_{x}, *b*_{x}, and *k*_{t} in λ_{x,t} are set to their respective ML estimates. This deviance measure assesses the model’s lack of fit: the higher the deviance, the greater the lack of fit. Booth et al. (2002) specifically referred to Eq. (7) as “the base lack of fit.”

To measure how much additional lack of fit is created when **k** is perfectly linear, Booth et al. (2002) defined another measure referred to as “the total lack of fit,” denoted by deviance_{total}(*s*). The definition of deviance_{total}(*s*) is the same as that of deviance_{base}(*s*), except that the ML estimates of *k*_{t} are replaced with *k*_{t} that lie perfectly on a straight line that has a gradient of *c* (the drift term in Eq. (4)) and passes through the average of the (revised) *k*_{t} at the midpoint of *I*_{T}(*s*).

*I*

_{T}(

*s*), such a ratio is calculated as follows:

The ratio is calculated for all possible values of *s*. If, *R*(*s*′ + 1) is substantially larger than *R*(*s*′) for some *s* = *s*′, then *I*_{T}(*s*′) is considered as the optimal calibration window because the linearity in **k** results in too much additional lack of fit as we extend the calibration window from *I*_{T}(*s*′) to *I*_{T}(*s*′ + 1).

Our proposed method has two distinct features that make it stand out from Booth’s. First, our sequential testing procedure is based on a joint test of the linearity of **k** and the time-invariance of **b**. Although Booth et al. (2002) found that the time-invariance of **b** holds better over the calibration window they selected, they did not explicitly include the time-invariance of **b** in their selection procedure. Second, our sequential testing procedure is based on a series of objectively determined critical values; by contrast, Booth’s method is based on a subjectively (and visually) chosen threshold for the values of *R*(*s*), and for this reason, it is not always easy to identify the optimal calibration window with Booth’s method. To illustrate this potential problem, in Fig. 5, we show two series of *R*(*s*) that are, respectively, calculated using data from the Belgian and UK unisex populations aged 0–89. Based on the *R*(*s*) series, it seems optimal to begin the calibration window for the Belgian unisex population in 1968. However, the conclusion for the UK unisex population is unclear because of the pattern and volatility of its *R*(*s*) series.

### Application to Narrower Age Ranges

In this subsection, we examine how the test results may change when the test is applied to age ranges narrower than 0–89. We consider two nonoverlapping age ranges: 0–49 and 50–89. To perform the additional tests, we calibrate two new sets of critical values with data from the U.S. unisex population over these two age ranges. The new results are shown in Table 3. Comparing the new results with the original results in Table 1, we notice that for many of the populations, the length of the optimal calibration window for the entire age range is in between the lengths of the optimal calibration windows for the two smaller age ranges. Also, for most of the populations, the optimal calibration window for ages 0–49 begins later than that for ages 50–89. It follows that these populations might have been subject to a recent structural change that applies exclusively to the mortality at younger ages.

Among all 34 populations, four populations (Spain, Austria, Switzerland, and New Zealand) have an optimal calibration window for the entire age range that is smaller than the optimal calibration windows for both of the smaller age ranges. For these populations, it is possible that **b** over each age range is somewhat invariant; however, when the entire age range is considered, the variation in **b** becomes more obvious and hence the homogeneity hypothesis is rejected earlier in the sequential testing procedure.

The results for the Swedish unisex population are particularly interesting. For this population, the optimal calibration windows for both smaller age ranges begin in 1993, whereas that for the entire age range begins in 1976. In other words, the optimal calibration window for the entire age range is almost twice as long as those for the two age ranges. To explain why the optimal calibration window for ages 50–89, is much shorter, we compare, for each of the three age ranges considered, the estimates of the drift term *c* when the model is fitted to data over 1976–2011 and 1993–2011, as shown in Table 4. As the beginning point of the calibration window changes from 1976 to 1993, the magnitude of the estimated drift term remains almost constant for ages 0–49, increases significantly by 26 % for ages 50–89, and increases moderately by 13 % for ages 0–89. The sequential testing procedure can pick up the possible structural change in **k** happening around 1993 when it is applied to the smaller age range of 50–89 but cannot when it is applied to the entire age range because the effect of the possible structural change in **k** (which applies exclusively to the older ages) is mitigated when data for the younger ages are included. To explain why the optimal calibration window for ages 0–49 is much shorter, we turn to Fig. 6, in which we show the estimates of **b** (for the entire age range) when the model is fitted to data for 1976–2011 and 1993–2011. As the calibration window widens, the estimates of *b*_{x} for ages below 50 change rather significantly, but those for ages 50 and older do not. When the test is applied to the smaller age range of ages 0–49, the effect of the possible structural change in *b*_{x} for ages below 50 is strong enough to trigger a rejection of the homogeneity hypothesis as the beginning point of the calibration window moves from 1993 to 1992. However, when the test is applied to the entire age range, the effect of the possible structural change in *b*_{x} for ages below 50 is mitigated and is not strong enough to trigger a rejection in the same step.

## Concluding Remarks

In this article, we introduce a sequential testing procedure that allows us to identify the optimal calibration window for the Lee-Carter model. Through a joint log-likelihood ratio test, our proposed procedure incorporates both of the typical Lee-Carter underlying assumptions (linearity in the time-varying index and time-invariance of the age pattern of mortality decline) into the calibration window selection process. Our proposed procedure is more objective than Booth’s method because it is based on objectively determined critical values instead of a subjectively (and visually) chosen threshold. In addition, our proposed procedure may be applied to a wider range of mortality data sets because the patterns of the *R*(*s*) series in Booth’s method are inconclusive for some data sets.

We apply the proposed testing procedure to mortality data from 34 populations and find that the starting years of the optimal calibration windows for most of the populations lie between 1960 and 1990. Using an out-of-sample analysis, we demonstrate that for most of the populations, the Lee-Carter model that is estimated to the optimized calibration windows results in more accurate forecasts than one that is fitted to a typically used calibration window (e.g., one that begins in 1900 or 1950). We further apply the proposed testing procedure to data for different age ranges, and we demonstrate (with reasons) that the optimal calibration window may change as the age range changes. This point is useful to those who intend to produce Lee-Carter mortality projections for a restricted range of ages only.

The caveats noted previously are repeated here. First, when the volume of data is low, the parameter estimates may be subject to excessive sampling errors; thus, the log-likelihood ratio test might incorrectly regard a change in **b** and/or **k** due to sampling errors as a structural change. Second, the log-likelihood ratio test may not be able to identify a structural change that occurred near the end of the calibration window because of the lack of information on which the test can be based.

We also acknowledge that our proposed method may not be ideal if the change in parameter values is only temporary. Consider a hypothetical situation in which mortality data from 1900 to 2016 are available. Suppose that the data-generating process is the Lee-Carter model, whose parameters remain constant since 1900 except during 1955–1960. In such a situation, our procedure would detect a breakpoint in 1960. The resulting optimal calibration window (1961–2016) would exclude 1900–1954, over which the data are relevant.

In future research, it would be interesting to develop an extension for use with multipopulation mortality models. In such an extension, the homogeneity hypothesis should contain more dimensions. For example, for the augmented common factor model proposed by Li and Lee (2005), the homogeneity hypothesis should, at a minimum, encompass (1) the time-invariance of the common age patterns of mortality decline, (2) the time-invariance of the population-specific age patterns of mortality decline, and (3) the linearity of the common time-varying index.

## Notes

^{1}

In the Lee-Carter model, parameter *b*_{x} measures the sensitivity of the log central death rate at age *x* to the time-varying index. The collection of *b*_{x} parameters represents the age pattern of mortality decline.

^{2}

Lee and Carter (1992) also ignored the error term in Eq. (1). In fact, they did not even impose any distributional assumption on the error term.

^{3}

Note that **k**^{T,s} can be written as **k**^{T,s}(**b**^{T,s}).

^{4}

The Human Mortality Database (www.mortality.org) covers 39 countries/regions; we consider 34 of them in this article. We exclude Chile, Greece, Israel, Slovenia, and Taiwan because the starting years of the data sets for these countries/regions are later than 1960.

^{5}

The longest calibration windows for the U.S. and Japanese unisex populations begin, respectively, in 1933 and 1947 (the years in which their mortality data sets begin).

^{6}

The difference between the levels of the two *k*_{t} series in each diagram is due to the constraint $\u2211t\u2208ITskt=0$.

^{7}

Data available online (http://www.infoplease.com/ipa/A0922292.html).

^{8}

In the second quinquennium (1972–1976) of the optimal sample period, these cohorts were aged 42–46, on average. Admittedly, the high smoking prevalence among these cohorts does not explain why the *b*_{x} estimates for this age group are lower when the shorter calibration window is used. The empirical fact that the shorter calibration window gives lower estimates of *b*_{x} should be attributed to other reasons, such as that concerning non-Hispanic whites.

^{9}

When evaluating their optimization method, Booth et al. (2002) consider similar alternative calibration windows; one begins in 1950 and the other begins in 1907 (the year in which the Australian mortality data set begins).

^{10}

Note that the MAPE also reflects errors in the age dimension, which are not observed in the across-age average log rates shown in Fig. 4.