## Abstract

The assessment of the impact that socioeconomic determinants have on the prevalence of certain chronic conditions reported by respondents in population surveys must confront two problems. First, the self-reports could be in error (false positives and false negatives). Second, those reporting are a selected sample of those who ever experience the problem, and this selection is heavily influenced by excess mortality attributable to the condition being reported. In this article, we use a combination of empirical data and microsimulation to (a) assess the magnitude of the bias attributable to the selection problem, and (b) suggest an adjustment procedure that corrects for this bias. We find that the proposed adjustment procedure considerably reduces the bias arising from differential mortality.

## Introduction

Accurate inferences about incidence of phenomena are generally made from data collection plans that follow observations over time and allow precise measurement of the timing of occurrence of relevant events. However, longitudinal designs are expensive enterprises, and sometimes researchers replace them with single-wave cross-sectional surveys with retrospective recall. For example, a significant number of phenomena—such as onset of illnesses, recovery from treatment, menopause, weaning, leaving home, and first marriage—rely on information collected retrospectively in cross-sectional surveys. But retrospective recall of events and their timing is often inaccurate, and statistical inferences from this information stand on shaky ground.^{1}

An alternative to retrospective recalls is current-status data—that is, information about the occurrence of a relevant event prior to a time marker, such as the date of a survey. This information is less sensitive to recall problems and can be retrieved easily in conventional interviews. Under some conditions that we examine in this article, this information and associated statistical tools are a good basis for inferences about the underlying incidence of a phenomenon and for the identification of the determinants of its intensity and duration profile (Diamond and McDonald 1991; Keiding 1991, 2006; Keiding et al. 1989, 1996; Sun and Kalbfleish 1993). The information is sometimes aggregated and represented as prevalence data: namely, the fraction of the observations that experiences the event by a time marker.

Current-status data have an important drawback: they rest on the assumption that attrition of individuals who experience the event of interest is the same as the attrition of those who do not. This assumption is violated when the event under study (an illness or disability, for example) is associated with at least one source of attrition (e.g., mortality). Although this weakness of current-status data is well known (Keiding 1991), it is normally trivialized, dismissed, or altogether ignored in empirical applications. Keiding (1991) considered the case of differential mortality in the analysis of current-status data but assumed that the mortality hazards (or their difference) are known. Little work, however, has studied current-status data when the mortality difference is unknown (Jewell and Van der Laan 2004), the problem that occupies us here.^{2}

In this article, we assess the magnitude of the bias that arises when the assumption of *homogeneous risks* (i.e., identical pre-survey attrition among those who do and those who do not experience the event of interest) is violated, and we develop an adjustment procedure that corrects the bias. The article is organized as follows: the next section reviews an example of the application of current-status techniques in population studies. Then we introduce a maximum-likelihood approach for analyzing current-status data in the context of differential pre-survey attrition and discuss how this situation relates to the broader literature on selection biases. This is followed by an assessment of our adjustment procedure with Monte Carlo simulations. We then apply the adjustment procedure to a concrete case, summarize the results, and conclude.

## An Example: Timing of Marriage and Proportions Single^{3}

Almost 60 years ago, John Hajnal proposed the use of the singulate mean age at marriage, better known as SMAM, to estimate the mean age at marriage (Hajnal 1953).^{4} This was the first demographic application of current status techniques. SMAM has been widely used to assess the timing of first marriage from census or survey information on the age-specific proportion single. Under some conditions, the age-specific prevalence of singlehood is a good indicator of the (single-decrement) probability of remaining single during the interval elapsed between the age at which the population begins to marry and the age of the individual at the time of a census or survey. The first condition is that first-marriage rates be time-invariant (stationarity). The second is that the risks of attrition, mostly induced by mortality or migration, be identical among single and married people (risk homogeneity).^{5} A very large literature and influential theories on historical demography rest on observed trends of SMAM or age-specific proportion single in some target age group, usually 45 or 50. In his landmark studies of the so-called Western European marriage pattern, Hajnal (1965) made extensive use of these quantities to characterize two different continental marriage regimes, a distinction that became influential on subsequent research on fertility and family formation.

While most analysts are aware of the need to invoke assumptions about these two conditions, we know of no study since Hajnal (1953) that has evaluated the pitfalls when the second assumption is inappropriate.^{6} To begin with, inferences from SMAM can be correct only when there are no time trends in first marriage. When the stationarity assumption is violated, current status procedures should be avoided. By the same token, we know that in violation of the assumption of risk homogeneity, mortality risks among single individuals are higher than among the married (Hu and Goldman 1990; Kisker and Goldman 1987; Livi-Bacci 1985). If stationarity holds but mortality risks of single individuals are higher than those of the married population, the observed proportions single will be too small and SMAM will be too large (Hajnal 1953).^{7} Simulations with a wide variety of marriage and mortality patterns indicate that the sensitivity of SMAM to mortality differentials is not trivial (see the appendix for details on the simulations): a 1 % mortality differential can produce proportionate errors in SMAM that are as small as 0.1 % and as large as 0.8 %, depending on the proportion that eventually marries. If the probability of ever marrying hovers round .75–.85, as was the case in Northern and Western Europe in the middle of the nineteenth century, a mortality differential equivalent to 10 % will bias SMAM upward by approximately 7 %. With a true mean age at marriage of 28 and a probability of ever marrying of about .80, fairly typical parameters in preindustrial Europe, and mortality differentials of the order of 10 %, the upward bias could be as high as 2.1 years. If the differential is 20 %, the bias will be 4 years.^{8} Errors of higher magnitude distort the proportion single at older ages, say 50 or 55, beyond which first marriage is negligible. When singles’ mortality is higher than overall mortality, the observed proportion single is an underestimate of the true probability of being single.

But errors can be worse: if mortality differences between single and married decline over time, the observed trend of SMAM will be downward and will yield the appearance of a decline in the mean age at marriage even in the absence of any real time trend in first marriage rates. If, as happened in Europe, a decline in the mean age at first marriage is accompanied by a reduction of mortality differences, the observed trajectory of SMAM will exaggerate the magnitude of the rate of decline in the age at first marriage. Similar errors endanger inferences about regional or national patterns. Indeed, the divide between the so-called Western and Eastern marriage pattern was established on the observation of higher SMAM and higher proportion single in Northern and Western Europe. This can be interpreted as a result of different marriage patterns only in the unlikely situation that the magnitude of married-single mortality differences was identical in Eastern, Northern, and Western Europe.

## Estimation From Current Status Information

### Notation

Suppose we are interested in analyzing a random sample of current-status data collected in a (cross-sectional) survey of a population at time *t* with information on the presence or absence of a disease. The process under investigation involves the transition between the health and diseased states, and the mortality risk from each state, as illustrated in Fig. 1. Denote the presence or absence of the disease by *Y*_{i} = 1 or *Y*_{i} = 0 for sample member *i* at time *t*. Let be the probability that a sample member aged *x*_{i} is observed with the disease. The hazard function for contracting the disease at age *x*_{i} is , and the corresponding survival function is . Let μ(*x*_{i}) and *S*(*x*_{i}) be the hazard of mortality and survival function, respectively, for individuals who are age *x*_{i} at time *t* and who do not have the disease. The corresponding hazard of mortality for those observed with the disease is . Finally, we make the assumption that fertility, mortality, and the incidence of disease are all independent of time, and thus the timing of the survey is independent of all three processes.

### A Likelihood Approach^{9}

*N*observations of current status is

For those who do not contract the disease (*Y* = 0), Eq. (2) is equal to the probability of surviving to age *x* multiplied by the probability of not experiencing the disease by age *x*_{i}; otherwise (*Y* = 1), the expression is equal to the probability of surviving to age *x*_{i} multiplied by the probability of contracting the disease by age *x*_{i} (Keiding 1991, 2006).^{10}

*x*as a function of covariates and associated parameters and estimates them by maximizing the likelihood in Eq. (3) (e.g., Diamond and McDonald 1991). We are concerned here, however, with the case of differential mortality between those who do and those who do not contract the disease by age

*x*

_{i}. In this case, Eq. (3) does not apply because the survival function, , carries information about the incidence of disease. Under this condition and assuming constant relative mortality, for a positive constant , the likelihood iswhere is the conditional density of the age of disease onset given that the disease is contracted by age

*x*

_{i}(see the appendix for further details). We develop a new current-status technique to model the dependence between survival and the onset of disease, as it is expressed in Eq. (4). Begin by noting that the limits of integration span the possible ages for which the (conditional) density function, , takes on positive values. Thus, the integral in Eq. (4) is equal to the expected value of the exponential function , where the expectation is taken with respect to the random variable

*k*. The exponential function is the relative survival probability for those with the disease compared with those without the disease (see Andersen and Væth 1989; Ederer et al. 1961). We approximate the outer integral in Eq. (4) using the delta method, which yieldswhere denotes the expectation (with respect to

*k*∼

*f*

^{†}), is the integrated hazard of mortality for those observed without the disease, and is the expected age of disease onset for those who contract the disease by age

*x*

_{i}.

**Z**:where . Furthermore, if we assume that can be taken from a standard life table (as discussed later) and combine Eq. (6) with Eq. (5), then the conditional likelihood,

*L*

_{c}(given survival to time of survey), is approximatelyor the likelihood of a conventional logistic model with the set of covariates expanded to include .

^{11}If we erroneously assume that the risk of mortality is the same for those with and without the disease, then our estimates of will be wrong because we are not accounting for the additional terms in the likelihood. However, we can adjust our estimate using the approximation in Eq. (7) with a suitable candidate for .

^{12}

### What Should Be?

The quantity stands for the integrated hazard of mortality for those who do not contract the disease, and the lower limit of integration is the *expected* age at disease onset for those who are aged *x*_{i} and who contract the disease by this age. The value of for individual *i* can be calculated exactly only if one knows (a) the incidence curve (of disease) and its determinants, (b) the force of mortality at age *x*_{i}, and (c) the parameter of excess mortality, . First, the incidence curve can be approximated from retrospective (but noisy) information about the timing of the onset of disease or from known incidence curves in populations similar to the one under study. We show that the adjustment we propose is largely insensitive to even relatively large variability of the values of . Second, throughout our discussion, refers to the mortality risk at age *x* among those who do not have the disease. This quantity is unlikely to be known with any precision. However, to implement our adjustment procedure, it suffices to identify a standard (baseline) age pattern of mortality that applies to both those who experience and those who do not experience the disease.

### Estimation in the Presence of Covariates

Assume two subgroups defined by a binary covariate *Z*. Assume also that each of them experiences risk heterogeneity that can be parameterized using a unique standard pattern of mortality . This is equivalent to defining and for the first subgroup , and and for the second , where and θ_{z} are positive constants. Thus, the adjustment factor can be computed using the same function *μ*_{s} for all observations. With this parameterization, we define a logistic model that includes the following independent variables: , a dummy variable *Z*, the integrated hazard of mortality , and the interaction term . The estimated coefficient of *Z* corresponds to the effect of subgroup *Z* = 1 on the incidence of disease, the coefficient of is an estimate of , and the coefficient of the interaction term is an estimate of . Under these assumptions, we can always retrieve the effects of *Z* as well as estimates of the mortality differential between those who experience the disease and those who do not, but we cannot identify the parameters of mortality risks for each subgroup.

### Current Status, Unmeasured Heterogeneity, and Sample Selection Bias

The problem formalized above is a member of a more general class of problems characterized by two features: (1) the phenomenon of interest is only partially observed (e.g., observed only in a subset of individuals who experienced it); and (2) those who experience the event but are unobserved are removed from observation because of events whose risks have increased after the occurrence of the phenomenon of interest.

A well-known case belonging to this class is the classic sample selection problem, in which an outcome of interest (discrete or continuous) is observed only among a subset of the sample members who differ systematically (from the rest of the sample) on observed and unobserved characteristics (Berk 1983; Fligstein and Wolf 1978; Greene 1981; Heckman 1979; Little 1995; Wooldridge 1995). Some of these characteristics increase the risk of not observing the outcome of interest. Ignorance about the current status of a nonrandom subsample of observations is akin to ignorance about the nature of the outcome of interest in the classic sample selection case. However, in the classic case, the researcher can deploy adjustments using partial information available for all individuals, including those whose outcome the researcher knows nothing about. In contrast, in the current-status problem we address in this article, no such adjustments are possible because the researcher does not have any information about those individuals among whom we cannot observe the event of interest. The current-status problem can be, and indeed has been, formulated using modeling tools characteristic of the classic sample selection problem. But such tools invariably require the rigid formulation of unverified and unverifiable distributional assumptions (usually normality of latent traits), most of which are unsuitable for dealing with disease incidence and mortality (Bloom and Killingsworth 1985; Maddala 1983).

A second class of problems tightly related to current status is, unsurprisingly, the so-called unmeasured heterogeneity problem (Heckman and Singer 1984; Hougaard 2000; Manton and Stallard 1981; Trussell and Richards 1985; Trussell and Rodríguez 1990; Vaupel et al. 1979; Vaupel and Yashin 1985). This arises in the estimation of hazard models from longitudinal data whenever the occurrence of the event of interest is a function of variables that are unmeasured or ignored. The central problem of these models is the estimation of the rate of occurrence of a phenomenon within a sequence of time intervals. The estimated magnitude of the rate for any time interval depends on the composition of the sample of individuals who are exposed to the event at the beginning of the interval. Ordinarily, this subsample does not include individuals who experienced the event before the origin of the time interval. Like the case of current status, we observe the occurrence/non-occurrence of an event (within a time interval) partially (e.g., among only the subsample of individuals who did not experience the event before the time interval). Those who are excluded from the exposed subsample vanish from observation because they possess traits that increase the risk of experiencing the event throughout. But unlike the current-status problem and like the case of sample selection, the researcher may use available partial information about the unobserved individuals and/or invoke plausible assumptions about the makeup of the subset that is not observed and, armed with this information and/or assumptions, proceed to remove totally or partially the biases attributable to partial observations (Heckman and Singer 1984; Hougaard 2000; Trussell and Richards 1985). These adjustments, however, cannot be implemented in the current-status problem that occupies us here because the researcher knows nothing about individuals whose current status cannot be assessed. If the researcher is able to collect repeated current-status information over time on an initial sample of individuals, then the situation will resemble and indeed converge (as data collection times grow and become arbitrarily close to each other) to the standard unmeasured heterogeneity problem.

## Monte Carlo Simulation

In this section, we evaluate the proposed adjustment using Monte Carlo simulations of a population stratified by membership in either a low- or high-education group. Our primary interest is in the age-specific prevalence of diabetes and how it varies by education group. We consider a number of scenarios characterized by different levels of education-specific risk heterogeneity implemented via the force of mortality associated with diabetes. Each education group experiences its own mortality differential (between diabetics and nondiabetics), and the various combinations explored in the simulations are presented in Table 1. Four general scenarios are investigated: (a) no mortality differential in the high-education group and an increasing differential in the low-education group; (b) the mortality differential in the high-education group is equal to the differential in the low-education group; (c) a constant mortality differential in the high-education group and an increasing differential in the low-education group; and (d) mortality differentials increasing at the same rate for each education group, but with larger differentials in the low-education group. For each scenario, we simulate a sample of current status data and fit two logistic regression models to estimate the effect of education on the log odds of having diabetes. The first is a naïve model that ignores differential mortality between diabetics and nondiabetics, while the second model includes the adjustment factor proposed earlier.

### Simulated Populations

Consider the population at some time *t* when members range in age from 31 to 100 years and have been exposed to the risk of both dying and becoming diabetic. We choose to start exposure at age 30 so that the youngest cohort of survivors to time *t* (observed at age 31) has been exposed to both risks for one year, whereas the oldest cohort of survivors at time *t* and observed at age 100 has been exposed to both risks for 70 (completed) years. We assume that the sizes of the birth cohorts are unequal and that the initial size of each grows at rates between .001 and .005 per year. With a radix of 1,000, this yields a total of roughly 60,000 and 70,000 individuals in the high- and low-education groups, respectively.^{13}

We assume that the log of the waiting time to developing diabetes follows a logistic distribution with a constant variance and a mean that is higher among those with high education relative to those with low education. In the absence of mortality, the prevalence of diabetes at age 100 is expected to be roughly 20 % among the high-education group and just over 40 % among the low-education group. The resulting logistic regression of the log odds of being diabetic on the log of age and a dummy variable for the low-education group yields a (true) coefficient of 1.00 for the low-education covariate.

The force of mortality follows a Gompertz function with a level parameter that varies by education group. We simulate a number of scenarios to investigate different magnitudes of the mortality differentials between diabetics and nondiabetics, as well as variation in the size of the differentials across education groups (see Table 1). Each of the resulting scenarios is simulated 25 times.^{14} Under risk heterogeneity, once an individual develops diabetes, the risk of mortality increases relative to those without diabetes. As individuals who contract diabetes are exposed to higher mortality, the observed prevalence of diabetes in the group will increase with age less rapidly than it would in the absence of mortality. The age-specific probability of being diabetic by education group is shown in Fig. 2. The curves in this figure show the population prevalence of diabetes in the absence of mortality for the low- and high-education groups. Figure 2 also displays the prevalence of diabetes in the low-education group (depicted by the circles) for a simulated scenario in which the members are exposed to diabetes-specific mortality rates that are roughly 70 % higher than among nondiabetics with low education. This is reflected in departures from the expected values in the absence of mortality. Conversely, no mortality differentials between diabetics and nondiabetics are implemented in the simulation of the high-education group. This is reflected in simulated prevalence rates (depicted by the triangles) that fluctuate randomly around the expected probabilities of ever contracting diabetes (with increasing variance as the cohort sizes decrease with age).

### Estimates, Biases, and Adjustments

A common approach to assessing the size of education differentials in diabetes is to fit a logistic model to the log odds of being diabetic regressed on a constant, a dummy variable for the low-education group, some control variables, and the log of age.^{15} In the absence of risk heterogeneity (as manifested through differential mortality) in both education groups, the estimated coefficient of the education dummy variable will reflect (on average) the difference between the two solid lines shown in Fig. 2. If there are mortality differences only among the low-education group, then the estimated effect of the education dummy variable will reflect the difference between the two sets of symbols plotted in Fig. 2. These estimates lead to the erroneous inference that education has no effect on the incidence of diabetes. The bias depends on the size of the mortality differential in the low-education group.

Our primary focus is the estimated coefficient for the dummy variable identifying the low-education group and its variation with the magnitude of the mortality differential in this education group.^{16} The main results are presented in panel (a) of Fig. 3. The axis at the bottom of the plot shows the factor by which mortality among diabetics exceeds that of nondiabetics in the low-education group: it increases from 1 to 2. We assume no differences among the highly educated (as indicated by the axis on the top of the plot, which is fixed at a value of 1). When there are no mortality differences in either education group, the estimated coefficients for the (low) education dummy variable center on 1, the true value in the simulation. As the size of the mortality differential among the low-education group increases, the values of the estimated coefficients decrease in size toward zero, as expected.

Figure 3 also shows results for cases in which there are mortality differentials in *both* education groups. As pointed out earlier, if the magnitudes of the mortality differentials are the same, then there will be no bias in the estimated effect. This is shown in panel (b) of Fig. 3, where there are only small, random fluctuations in the coefficients for each level of the mortality differentials shown. In panel (c) of Fig. 3, the mortality differential is constant across the different scenarios for the high-education group while it increases for the low-education group. When the mortality differential is larger in the high-education group, the estimated effect of education contains an upward bias, and when the differential is larger in the low-education group, bias is downward. The final panel in this figure displays the bias as the sizes of the differentials increase at constant rates for each education group.

The adjustment procedure requires that we use a logistic regression model to estimate the difference in diabetes prevalence across the two education groups. The new model should include an intercept, the log of age, a dummy variable for low education, the integrated hazard of mortality (from a suitable standard population), and its interaction with the education dummy variable. Results of fitting a logistic model to the simulated data are displayed in Fig. 4. The figure shows the bias (true minus estimated values) of the coefficient for the education dummy variable, for different combinations of mortality differentials in each education group, and in the unadjusted and adjusted models. The values plotted in the figure are the mean bias over the 25 simulated data sets.

The unadjusted estimates exhibit the same patterns described earlier: if the differential is the same in each education group, there is no bias, but if the differential is larger in the low-education group, the estimated coefficient for the education dummy variable is downwardly biased. Conversely, if the differential is smaller in the low-education group, then the estimated coefficient will be upwardly biased.

The adjusted estimates are obtained after controlling for the baseline (standard) integrated hazard. In all cases, the integrated hazard associated with individuals aged *x* is evaluated using the conditional mean of the age of onset of diabetes in the *true* incidence curve as the lower bound of integration . Figure 4 displays the bias from the adjusted logistic model under various scenarios defined by the magnitude of risk heterogeneity in each education group. The average bias from the adjusted model forms a flat plane close to zero, and the mean adjusted estimate is within a few percentage points of the true effect. Contrast this to the 40 % (downward) bias of the unadjusted estimate when diabetes-specific mortality in the low-education group is four times higher than nondiabetic-specific mortality and there is no mortality difference among the high-education group.

These simulation results suggest that the proposed adjustment procedure is effective under the conditions used to simulate the data. However, the adjusted estimates are obtained using the true incidence curve to calculate the lower bounds of the integrated hazards that enter as an adjustment factor in the logistic model. How sensitive is the adjustment procedure to misidentification of the distribution used to calculate the mean ages of onset?

To test the sensitivity of the adjustment procedure, we assume five log normal (LN) distribution functions, all shown in panel (a) of Fig. 5, and use these (instead of the true log logistic function) to calculate the integrated hazards. The LN distribution functions cover a wide range of age patterns, with the probability of being diabetic by age 100 (in the absence of mortality) ranging from a low of around .05 to a high of close to 1. The conditional mean ages of onset of diabetes for ages 30–100, the values for the resulting integrated hazard, and the frequency distribution of the 25 estimated coefficients from the adjusted model are shown in panels (b), (c), and (d), respectively. Note that despite large differences in the conditional mean ages of onset and the associated integrated hazards, estimates from the adjusted models are centered on values that are close to the true value of the coefficient (1). The figure reveals that the choice of distribution for the calculation of the integrated hazards matters when one chooses a distribution that is an extreme departure from the underlying one (compare the distribution functions in the first panel). But even an extreme choice does not generate mean errors exceeding 8 %.^{17} This pales when compared with the biases associated with the unadjusted estimates (as large as 40 %). Figure 6 is unequivocal on this point: the figure compares the biases when no adjustment is used, when using each of the five log normal distributions, and when using the mean estimate from these distributions. The biases are large and, in the absence of any prior knowledge, correcting for risk heterogeneity is always a better strategy than not correcting at all.

In summary, there is bias in the estimated effect of education on the log odds of being diabetic when the force of diabetes-specific mortality varies across the education groups. The bias can be reduced by assuming that the log odds of being diabetic are a linear function of the covariates and by including the integrated hazard of nondiabetic mortality as a covariate in the logistic regression model along with an interaction term between the integrated hazard and the dummy variable for education. For individual *i* aged *x*_{i}, the lower limit of integration for this additional covariate is the conditional mean age of diabetes onset given that the individual has diabetes by age *x*_{i}; the upper limit of integration is *x*_{i}. The adjustment procedure yields correct results when the conditional means that serve to calculate adjustment factors are drawn from a distribution function that is similar to the one that underlies the occurrence of the event of interest. And although the results of the adjustment are sensitive to the specification of this distribution, it still performs much better than a naïve approach that ignores mortality differentials and how they differ across education groups.

## Application

We evaluate the adjustment procedure using two data sets on elderly people: one for Mexico, MHAS, and the other for Puerto Rico, PREHCO. Both are panel surveys of elderly populations (aged 50 or older in MHAS and aged 60 or older in PREHCO), and they both consist of two waves, separated by two years (MHAS) and four years (PREHCO). The first waves were fielded in 2000 and 2002 in MHAS and PREHCO, respectively. Both surveys elicited self-reports on diabetes in the first and second waves, and in both cases there is information on interwave mortality. Within the limitations in population panel data of this kind, MHAS and PREHCO provide us with enough information to estimate mortality differences between diabetics and nondiabetics but not enough to estimate the true incidence of diabetes at adult ages.^{18}

In our application, we use observed prevalence data in the first wave (the current-status information on diabetes) to estimate the effect of the education covariate both with and without the adjustment procedure. In addition, we retrieve an estimate of the mortality differential between diabetics and nondiabetics. In the absence of information on the true incidence of diabetes, a comparison of the estimated mortality differential to the observed mortality differential in each education group is the only benchmark we have to judge the performance of the adjustment procedure.

### MHAS

The first column of Table 2 displays estimated effects of a dummy variable for education (*D*), distinguishing low education (*D* = 1) and high education (*D* = 0) on the observed prevalence of diabetes in the first wave.^{19} The second column shows estimates of a logistic model, including two controls for the integrated hazard, one for each education group. As would often happen when there is risk heterogeneity, the unadjusted effect of (log) age is negative and the one for education is close to zero. After the adjustment, the effect of (log) age changes sign and the one for education increases in magnitude (from .032 to .130) but is only marginally significant (at *p* < .05). Thus, even though the variable for education does not attain statistical significance, the estimates change in the direction one would expect. We do not know, of course, what the truth is: it may well be that education has no effect on the incidence of diabetes and that, contrary to our *a priori* expectations, the adjusted estimates simply reflect this. Previous studies have suggested a negative relationship between education and diabetes prevalence (Aguilar-Salinas et al. 2002; Dalstra et al. 2005; Robbins et al. 2001, 2005). The direction of this relationship is subject to change (from positive to negative) as a country develops and progresses through the nutritional transition (McLaren 2007; Monteiro et al. 2004; Popkin 1998). Given the level of development of Mexico and Puerto Rico, we expect a negative association. These results are extended below to generate a test of how well the adjustment procedure performs.

We apply two adjustment factors, one for each education group. Using the observed mortality from the interwave period, we fit Gompertz models for each education group separately and irrespective of diabetes status. We then calculate the integrated hazard using the parameters of the Gompertz model for each education group. Thus, the standard mortality pattern used to calculate integrated hazards is the same within each education group (for diabetics and nondiabetics) but different across education groups.^{20} Recall that the regression coefficients associated with the integrated hazard are estimates of the differences in mortality levels between diabetics and nondiabetics in each education group (e.g., ), where the parameters θ_{z} and are measures of the mortality levels for subgroup *Z*. In our case, these parameters correspond to the logs of the Gompertz constants. Because of the panel nature of MHAS, we can actually calculate these mortality levels directly and compare them with those obtained from the adjusted model. Although it is not a perfect test, this contrast will provide an indication of performance of the adjustment. Among those with low education, the *observed* difference in mortality levels between diabetics and nondiabetics is .59, whereas the *estimated* difference is .43 (minus the value of the regression coefficient). Among those with high education, the observed difference is .60, while the estimated difference is .36. The lack of concordance between estimated and observed values is probably due to departures from the assumption of identical mortality patterns and/or deviations from the Gompertz model. While not perfect, the rather close agreement between observed and estimated values of mortality differentials is reassuring, and we take it as an indication of the suitability of the adjustment.

In summary, the suggested adjustment leads to changes in the estimate of the covariate that go in the expected direction, and although this cannot be interpreted as a true effect, we find confirmatory evidence in the modest differences between estimated and observed mortality differences between diabetics and nondiabetics.

### PREHCO

Table 3 displays analogous results for PREHCO. The first column reveals that, unlike the case for Mexico, the effect of education in Puerto Rico is statistically significant even before adjustment and that, as in Mexico, the effect of (log) age is negative. After adjustment (second column) and as expected, the effect of education nearly doubles in size and becomes strongly significant; and, as in Mexico, the effect of (log) age changes sign and becomes positive. The estimates of mortality differences between diabetics and nondiabetics within each education group, however, are more removed from the observed values than was the case with Mexico. Thus, the expected difference in mortality levels between diabetics and nondiabetics is estimated to be .14 among those with low education and .09 among those with high education (Table 2, second column). While their relative magnitudes are as expected (larger among those with low education), the values are too small compared with the observed quantities, .88 and .89, respectively. Thus, the adjusted estimates reinforce an inference that could have been made with unadjusted values but are more fragile than in the case of Mexico because the test comparing estimated and observed mortality differences is less reassuring.

## Summary and Conclusion

By and large, conventional current-status analysis in particular and analyses of prevalence data in general give short shrift to potential errors that arise under risk heterogeneity—for example, when the risk of attrition prior to the time at which individuals' status is assessed depends on the occurrence/non-occurrence of the event of interest. Through suitable approximations and simulations, we show that even under mild conditions defining the regime of risk heterogeneity, the biases can be substantial and could lead to misleading inferences about the time profile of the underlying risks and/or about the effects of covariates. The adjustment procedure we propose is simple and can be deployed with little effort and with minimal knowledge about the age pattern of the risk of attrition. We show that the adjustment performs much better than the naïve estimate (which assumes no risk heterogeneity). The adjusted estimates are quite robust to the precise function governing the incidence of the event of interest, but even large departures from it will produce estimates that are much closer to the true values than naïve, unadjusted estimates.

Future research should proceed along three routes. The first is to investigate the asymptotic properties of the estimator suggested here. While these are well understood in the case of a logistic function, they are not so for other equally plausible functional forms. The second is to assess the robustness of the adjustment to an inaccurate rendition of the baseline hazard (e.g., mortality) that censors those who experience the event of interest more than those who do not. The integrated hazard on which the adjustment factor rests cannot be calculated without knowledge of this baseline hazard. This may be unproblematic in the case of adult mortality because what matters in these cases is to identify correctly the curvature of the hazard over the span of ages of interest, not its level. But in other applications, it may not be so clear what the baseline hazard should look like, let alone what its approximate curvature may be within a particular range of ages or durations. The third route of research is to assess the performance of the adjustment in a broader array of empirical cases and to determine the extent to which resulting estimates lead to correct inferences.

## Acknowledgments

An earlier version of this article was presented at the annual meeting of the Population Association of America, Dallas, TX, April 14–17, 2010. This study was supported by grants from the National Institute of Aging (R01 AG016209 and R37 AG025216) and the Fogarty International Center (FIC) training program (5D43TW001586) to the Center for Demography and Ecology (CDE) and the Center for Demography of Health and Aging (CDHA), University of Wisconsin–Madison. CDE is funded by the NICHD Center Grant 5R24HD04783; CDHA is funded by the NIA Center Grant 5P30AG017266.

### Appendix

#### Sensitivity of SMAM to Mortality Differentials by Marital Status

##### Multistate System for First Marriage

To assess the effects of differential mortality, we simulate a three-state system with one absorbing state and three transition rates. All members of a cohort start in the single state and can then either transit to the married state or to the absorbing state of death. After an individual is married, she either stays there or transits to the absorbing state.

###### Transition Rates for the First-Marriage Process

To model the transition rate from single to married, we use the three-parameter Coale-McNeill first-marriage function (Coale and McNeil 1972). The first-marriage rates are defined by an accelerated failure time model of the following form: , where is the first-marriage rate at exact age *x*_{i}; *G*_{s} is the standard marriage function (Coale and McNeil 1972); *a*_{0} is the minimum age at which a significant number of first marriages take place; is a scale parameter reflecting the speed of first marriage once it begins; and *K* is the ultimate proportion of individuals who ever marry, so that 1 – *K* is the proportion who remain single. We use values of *a*_{0} ranging from 10 to 17 in intervals of 1, values of σ ranging from 0.5 to 2.5 in intervals of 0.5, and values of *K* ranging from 0.5 to 1.0 in intervals of 0.1. Altogether, we define 240 first-marriage functions and associated risks.

To model mortality from age 10 to 60, we use Coale-Demeny’s female West mortality model with life expectancies in the range of 40–75 years (Coale et al. 1983). We then define different scenarios according to the size of the mortality difference between single and married people. First, we select a life table for married individuals with a life expectancy at age 10 equal to, say, *e*_{10}. Second, we define a set of life tables for single individuals so that their life expectancies at age 10 range from *e*_{10} to *e*_{10} + 15. For each combination of married and single life tables and each of the 240 first-marriage functions, we calculate SMAM using Hajnal’s standard expression and compare its value with the mean age at marriage associated with the first-marriage function. The difference between the two is due to mortality differences between single and married individuals.

##### Derivation of Eq. (4)

## Notes

^{1}

See Weinberg et al. (1993, 1994) for research on the consequences of erroneously assuming stationarity when analyzing retrospective data.

^{2}

Lin et al. (1998) studied the case of differential mortality and proposed a model for current-status data collected in an experimental setting in which all subjects are observed (and the monitoring time depends on the event of interest). We consider a different situation in which current-status data are randomly sampled from a population and differential mortality is more likely to prevent population members who experience the event of interest, relative to those who do not, from surviving to (and thus being observed at) the time of the survey (for individuals of a given age).

^{3}

See Palloni and Thomas (2011) for an additional example concerning trends in the prevalence of disability in the United States.

^{4}

^{5}

In what follows, *risk homogeneity* refers to a situation in which the risk of attrition before (and hence not being observed by) the time of a census or survey is independent of the event being studied. Conversely, risk heterogeneity is a situation in which precensus (survey) attrition occurs differentially among those who do and those who do not experience the event of interest.

^{6}

Again, see Weinberg et al. (1993, 1994) for research on the consequences of erroneously assuming stationarity when analyzing retrospective data.

^{7}

See Goldman (1993) for a simulation study of the roles of marital selection and marital protection in producing mortality differences between the married and single populations.

^{8}

Results of simulated values of SMAM under different conditions are available on request.

^{9}

See Palloni and Thomas (2011) for a derivation of these results from first principles.

^{10}

To move from Eq. (1) to Eq. (2), combine the terms in the second factor involving and note that the terms and combine to form the density function, which yields —that is, the distribution function, when integrated.

^{11}

If functional forms other than the logistic are deemed appropriate, the same conclusions about biases and inferential difficulties apply and only the functional form of the adjustment factor changes.

^{12}

It is also worth noting that , so 1 minus the estimated coefficient for the integrated hazard provides an estimate of the mortality difference between those with and without the disease.

^{13}

We simulate growing birth cohorts simply to mimic real populations. All results apply if all rates of growth are set equal to zero, or if there are no calendar time effects on fertility, mortality, or the incidence of diabetes.

^{14}

Although we started with a large number of simulations, a small number was enough to produce sufficient Monte Carlo variation. As a consequence, we settled on a total of 25 replicas.

^{15}

Some researchers (e.g., Smith 2007) prefer to fit probit models to prevalence data to make inferences about incidence. We have not investigated the magnitude of the biases when the researcher estimates a probit rather than a logit model.

^{16}

Recall that in the simulated data, this coefficient has a true value equal to 1.0.

^{17}

In the sensitivity analysis, there is no mortality difference among the high-education group, but in the low-education group, mortality is 4.48 times higher among diabetics compared with nondiabetics. The mean errors are smaller than those presented in Fig. 5 when the mortality differential decreases.

^{18}

Although these panel data can be used to obtain (noisy or error-ridden) estimates of diabetes incidence for any subgroup, we cannot do so before ages 50 (Mexico) or 60 (Puerto Rico). Because diabetes in these countries has a relatively early onset, the set of observed incidence rates is too incomplete to retrieve reliable effects of covariates. Additional information can be obtained online (http://prehco.rcm.upr.edu for PREHCO, and http://www.mhas.pop.upenn.edu/english/home.htm for MHAS).

^{19}

Low education is defined as less than 6 years of schooling, and high education is defined as 6 years or more.

^{20}

This is a refinement that we can introduce only due to the panel nature of the data.

## References

*Handbook of statistics*:

*Advances in survival analysis*