## Abstract

In Health Impact Assessment (HIA), or priority-setting for health policy, effects of risk factors (exposures) on health need to be modeled, such as with a Markov model, in which exposure influences mortality and disease incidence rates. Because many risk factors are related to a variety of chronic diseases, these Markov models potentially contain a large number of states (risk factor and disease combinations), providing a challenge both technically (keeping down execution time and memory use) and practically (estimating the model parameters and retaining transparency). To meet this challenge, we propose an approach that combines micro-simulation of the exposure information with macro-simulation of the diseases and survival. This approach allows users to simulate exposure in detail while avoiding the need for large simulated populations because of the relative rareness of chronic disease events. Further efficiency is gained by splitting the disease state space into smaller spaces, each of which contains a cluster of diseases that is independent of the other clusters. The challenge of feasible input data requirements is met by including parameter calculation routines, which use marginal population data to estimate the transitions between states. As an illustration, we present the recently developed model DYNAMO-HIA (DYNAMIC MODEL for Health Impact Assessment) that implements this approach.

## Introduction

In medical demography, demographic techniques are employed to model the consequences of morbidity, disability, and mortality for the size, composition, and structure of the population. Medical demographers focus in particular on how transitions between health states (e.g., presence or absence of disability or specific diseases) change the prevalence of these health states, total life expectancy, and duration in each health state (Barendregt et al. 1998; Crimmins et al. 1994; Manton et al. 2008). Models can be extended to include underlying determinants (risk factors) of incidence or mortality. Such models have found applications both in the field of Health Impact Assessment (HIA) (Cole et al. 2005; McCarthy et al. 2002; Mindell et al. 2008) and in priority-setting for health policy (van Baal et al. 2006; Kok et al. 2009) because they provide a methodology by which to model the effects of interventions or policies on population health. Such modeling proceeds by assuming that policies/interventions alter the prevalence of specific risk factor states in the population under study (e.g., the proportion of smokers) or change the transition rates between risk factor states (e.g., the rate of starting or stopping smoking). Demographic population health models can then be used to project the effects of the new risk factor situation by assuming that the risk-factor status influences disease incidence and mortality rates (see Fig. 1). The effects of this change in the risk factor (and thus of the policy/intervention) can subsequently be calculated by comparing the projected size, composition (especially in terms of health state), and structure of the population under these new conditions with that modeled under “business-as-usual” conditions.

For this type of demographic model, a Markov model is a natural choice. In such models, each person is characterized by a current state (presence or absence of modeled diseases and risk factor level). The model projects the future prevalence of states (state occupancy) by repeatedly applying a matrix of transition probabilities to the vector of current state occupancy. Such a matrix of transition probabilities contains the probabilities of changing from each current state to each possible next state.

In real-life applications, such Markov models often require a large number of states because risk factors, such as smoking and obesity, are related to multiple chronic diseases. States are formed by all possible combinations of disease and risk factor states, and thus increase exponentially with the number of diseases and risk factors included in the model. In addition, the central assumption of a Markov model (the “Markov property”) is that the transition probabilities depend only on the current state and not on previous states. In practice, however, the probability of dying from a disease, for example, can depend on the duration of exposure or time since diagnosis. This constraint to the use of the Markov model can be overcome by extending the number of states—in this example, let *state* include the aspect *exposure time* or *time elapsed since diagnosis*—resulting in a model that satisfies the Markov property and has more states. Furthermore, many risk factors (such as body mass index (BMI), or alcohol consumption) are measured on a continuous scale and most easily modeled in the form of a large number of distinct values, which also increases the number of states.

For use in applied settings, such as HIA or priority-setting, the choice of a particular simulation methodology is constrained by time and resources—in particular, modeling expertise and data (Lhachimi et al. 2010). Hence, designing a model with a large number of states is disadvantageous for a number of reasons. First, such a model may be too slow for routine use. Second, with many states, the number of transitions between states becomes extremely large, with the consequence that it is no longer feasible to obtain observational data on all transition rates individually. Therefore, a system is needed to derive transition rates from a model fitted on more general data. Lastly, despite increased complexity, transparency needs to be maximized.

Users of current modeling approaches either choose to apply in-house, computer-intensive solutions based on micro-simulation (Goldman et al. 2005; Van Meijgaard et al. 2009) or to circumvent these problems by modeling only population average exposure (Ford et al. 2007; Naidoo et al. 1997), using approximations (Hoogenveen et al. 2010) or limiting the scope of the model (Lhachimi et al. 2010).

In this article, we propose a set of solutions that can be used in demographic population health models when a relatively simple and fast running program is needed for use in applied settings.

The first aspect of our approach is that it combines the micro-simulation of risk factor (exposure) information with the macro-simulation of diseases and survival. This characteristic of the model allows simulation of sufficient risk factor detail (e.g., continuous risk factor, or time elapsed since exposure started or stopped), while the need for large simulated populations is avoided. Second, we present a procedure for splitting the disease state space into smaller state spaces, each representing an independent cluster of diseases that can then be updated and inspected separately. This step increases not only the speed of the model but also increases transparency because disease clusters can be inspected separately. Third, the introduction of random noise in the micro-simulation part of the model is restricted to a minimum by applying variance-reduction techniques.

Our solutions are illustrated here with a practical application—namely, the DYNAMO-HIA model—in which these solutions have been implemented. We also describe in detail how data requirements were kept to a minimum. Our solutions have made it possible to construct software that can include multiple diseases and detailed information on risk factor exposure, can be filled with available data, and runs at a reasonable speed during routine calculations.

## Methods for Computation

### Macro-Simulation

In a Markov model, a matrix of transition probabilities is repeatedly applied to a vector of state probabilities (state occupancies). A deterministic or “macro-simulation” approach simply carries out a repeated application of the transition matrix. For a model with N states, this process requires an *N* × *N* transition probability matrix. When there are a large number of states, because of either a large number of diseases or many risk factor states, this approach quickly becomes infeasible. For example, a model of the effects of smoking that includes 12 smoking states (never, current, and 10 classes of former smokers according to time elapsed since they stopped smoking), 15 smoking-related diseases, 90 age categories, and two gender categories has 12 × 2^{15} × 90 × 2 ≈ 70 million states and a transition matrix with 5 × 10^{15} entries. Approximate methods have been developed to handle such Markov models with many states (Hoogenveen et al. 2010), but they are constrained by a number of strict assumptions.

### Micro-Simulation

An alternative solution is to implement the Markov model as a stochastic or “micro-simulation” model. The main characteristic of this approach is that it does not keep track of the expected probability (occupancy) of every single state but instead acknowledges that many of these states are very rare (e.g., having four recognized diseases at the same time). Micro-simulation takes a representative sample of “individuals” from the state-space where the probability that an individual with a particular state is in the sample is proportional to the probability of that state. The sample may contain multiple individuals in the same state (when this state is common), while very rare states are often not included in the sample at all. The model then simulates the life courses of these individuals from the transition probabilities; that is, transition rates are used to draw (using a random number generator) a next state for each simulated person (fixed time-step simulation) or to draw a waiting time to the next change of state (discrete event simulation).

Such a model implements the same Markov model as mentioned earlier, but the manner of calculation is different: while the macro-simulation model calculates the probability of every possible state exactly (wasting time on extremely rare states that do not impact the overall result), the micro-simulation model simulates the distribution over the states in a sample of persons. When the simulation involves a sufficiently large sample, the proportion of simulated persons in a particular state will represent the probability of that state.

### The Partial Micro-Simulation Approach

The macro-simulation approach for a model with *N* states requires the complete *N* × *N* transition rate matrix, while the micro-simulation approach needs the states of the *M* simulated individuals and the *M* transition rate vectors of size *N* at each moment in time. Globally, therefore, micro-simulation will require less computational resources than macro-simulation when *M* < *N*. However, the random drawing of states in the micro-simulation approach will add random variation that needs to be averaged by using a sufficiently large **M**.

In our approach, we acknowledge that a micro-simulation approach seems to be the most suitable for modeling risk factors as risk factors are characterized by a large number of states, either because they are continuous or because of such aspects as time elapsed since starting or stopping the exposure. On the other hand, a macro-simulation approach is more suitable for modeling diseases because incidence rates of many diseases in the general population are low, thereby necessitating large samples for stable results. Therefore, we propose a combination of both approaches: namely, “partial micro-simulation.”

With partial micro-simulation, the risk-factor history for each simulated individual is generated in a manner similar to a fixed time-step micro-simulation: at each moment in time, each individual has a “risk-factor state,” and this state is updated by randomly drawing a new risk-factor state using the transition matrix between risk-factor states. In contrast, the disease part of the model uses macro-simulation: instead of assigning an individual to either the “with disease” state or the “without disease” state, the individual is assigned a probability of having the disease. Similarly, death is not simulated by assigning the “death” or “alive” state to each individual, but instead by assigning a probability of being alive.

In the case of multiple diseases, the “disease probability” aspect consists of a series of probabilities, one for each combination of all separate disease states. For example, in the case of two diseases, A and B, there are four disease probabilities: the probability of having neither disease, only disease A, only disease B, or both diseases.

For users unaccustomed to working with probabilities, this approach can also be viewed as taking a sample of risk-factor histories and then calculating a multi-state life table (Alho and Spencer 2005) separately for each of these risk-factor histories. An essential condition for using this approach is that the risk-factor transition rates are not influenced by the disease states of an individual. For example, the change in BMI cannot depend on the presence or absence of diseases. In our application, where the incidence of any one specific disease depends only on the risk-factor status before the individual actually gets the disease, this restriction is justified.

After the simulation is run, the survival, or the probability of a particular disease state, can be averaged over all simulated individuals, or over a subgroup of simulated individuals (e.g.. only 40-year-old men), yielding the average probability of the disease state or survival in the group under study. Also, at this stage, disability weights can be attached to the disease states, enabling the calculation of disability-adjusted life expectancy (DALE).

### Separate Handling of Disease Clusters

With a large number of diseases, and thus a large number of disease states, the calculation of multi-state life tables is time consuming when this calculation needs to be repeated many times, as in partial micro-simulation. However, as shown in Appendix A, if—conditional on the risk-factor history—transition rates for a disease (or for a cluster of diseases) can be taken to be independent of the transition rates for another disease (or cluster of diseases), then the multi-state life table can be split into a series of independent tables: namely, one per disease (or cluster of diseases). Each, now smaller, table can then be updated independently. This step reduces the simulation time and also increases transparency because instead of one large table with disease state probabilities, there is a series of smaller, more manageable tables. When a cluster contains only a single disease, this “table” contains only the probability of having that one disease. This approach has been proposed by Barendregt et al. (1998) for single diseases, but can be extended to clusters of diseases (see Appendix A).

## Variance-Reduction Methods

We employ three well-established variance-reduction methods to further mitigate the random noise of the micro-simulation approach.

First, the aim of models for HIA and priority-setting for health policy is to compare interventions or scenarios. The variance in this comparison—that is, between two or more scenarios—can be reduced by keeping the random components of the risk-factor exposure equal for the same individual under different scenarios or interventions. This condition ensures that the random component of the risk-factor simulation affects the differences between scenarios only in the second order (Shechter et al. 2006).

Second, variance is introduced when an initial population is generated by using a random generator to draw a risk-factor state for each individual. In our micro-simulation approach, this variance is minimized by assigning the initial risk-factor distribution in a deterministic manner—not by drawing it randomly—thus making the risk factor distribution as equal as possible to the targeted risk-factor distribution.

Third, for categorical risk factors, any discrepancies between the targeted risk-factor distribution and the realized distribution in the simulated population are dealt with by weighting the results. As such, results are obtained that have at the start of the simulation exactly the targeted risk-factor distribution.

### The DYNAMO-HIA Model

As an illustration, we present the recently developed model DYNAMO-HIA (DYNAmic MOdel for Health Impact Assessment), which was developed for HIA but can also be used for priority-setting in the health policy context. DYNAMO-HIA first projects the population under “business as usual” conditions, as well as alternatively under one or more scenarios in which the risk-factor distribution in the population has changed. In its current form, only a single (but generic) risk factor, which can be freely chosen by the user, can be included in the DYNAMO-HIA model. This “single” risk factor, however, may be a combination of risk factors (e.g., drinking smokers, nondrinking smokers, drinking nonsmokers, and nondrinking nonsmokers). As this example demonstrates, this limitation is a feature of the current implementation and can be modified in future applications.

DYNAMO-HIA applies an epidemiological model that defines transition rates by using a limited set of parameters, as well as a module that generates an initial population from marginal population data (Appendix B). After a simulation has been run on a simulated sample of the initial population, the simulation results are scaled up toward the population numbers of the real population in a post-processing step; and summary measures of population health, such as life expectancy or disease-free life expectancy, are calculated. Also in this step, disability weights can be attached to disease states to calculate DALEs (Appendix C).

## Epidemiological Model in DYNAMO-HIA

For running a risk factor–disease Markov model, the user needs a matrix of transition rates (e.g., how many smokers with diabetes die or become smokers with diabetes and heart disease).

In practice, it is neither feasible nor necessary for a user interested in performing a HIA to obtain all these data at this level of detail. Therefore, rather than the data comprising an unrestricted matrix of transition rates, an epidemiological model is applied that defines the transition rates based on a limited set of parameters.

This epidemiological model defines transition rates in continuous time: that is, transitions take place in an infinitely small period of time (Appendix D). The epidemiological model of DYNAMO-HIA parameterizes two transition rates: incidence rates and mortality rates. Remission rates are currently not included.

### Incidence Rates

*i*to having disease

*i*is called the (nonfatal) incidence rate of

*i*and is described by Eq. (1): , where

*r*is a risk factor state;

**C**is a vector of states of the causal diseases—that is, diseases that are a cause of another disease—with elements {

*c*

_{1}, . . . ,

*c*

_{m}} (e.g., for five causal diseases: {0,0,1,0,0});

*I*

_{0,i}is the baseline incidence of disease

*i*—that is, the incidence in the state where all relative risks are equal to 1; is the relative risk of risk factor state

*r*on disease

*i*; and is the relative risk of causal disease

*j*on disease

*i*.

In Eq. (1), the baseline incidence *I*_{0,i} is estimated from the input data (see Appendix B), while the relative risks are input to the program.

### Mortality Rate

*r*and disease state

**D**[the mortality rate

*M*(

*r*,

**D**)] is given by Eq. (2): , where

*M*(

*r*,

**D**) is the mortality rate given risk factor state r and disease state

**D**;

*r*is the risk factor state;

**D**is a vector of states of diseases with elements {

*d*

_{1}, . . . ,

*d*

_{n}};

*Am*

_{i}is a model parameter referred to as attributable mortality, with the restriction that the mortality rate is due exclusively to disease

*i*;

*RR*

_{r→OC}is the relative risk of risk state

*r*for other causes of mortality (i.e., other than from the diseases included in the model); and

*M*

_{0,OC}is the baseline for other-cause mortality.

The dependence of the other-cause mortality on risk-factor status is optional in the model. Without this dependence, the other-cause mortality is the same for all simulated subjects, and the risk factor has only an indirect effect on mortality: namely, only through its effect on disease incidence.

*M*(

*r*,

**D**) given by Eq. (2), as follows in Eq. (3): where where

*c*

_{j}is an element of the vector of causal disease states

**C**; is the baseline fatal incidence rate of disease

*i;*is the relative risk of risk factor state

*r*for disease

*i;*and is the relative risk of causal disease

*j*for disease

*i*.

This mortality rate is intended for use in modeling diseases for which the mortality in the first period after disease onset is much higher than that in later periods. The model assumes that the fatal cases die at or shortly after disease onset, while the nonfatal cases are subjected to a constant attributable mortality over time. Those with nonfatal disease are still at risk of getting a fatal event. This disease process was initially introduced for modeling cardiovascular diseases.

A disease with a cured fraction is basically split up at time of diagnosis into two diseases: a “cured disease” and a “not-cured disease.” Individuals with the latter have an increased mortality rate attributable to the disease (attributable mortality), which is constant over time, while those with the former have an attributable mortality of 0. This disease process was introduced with cancer in mind. It accommodates a mortality pattern where the attributable mortality slowly declines toward 0 with increasing time from diagnosis.

For both the cured and not-cured disease, *M*(*r*,**D**) is given by Eq. (2) setting *Am*_{i} to 0 for the cured disease.

#### Model Input Needed

Some of the parameters needed for this epidemiological model—namely, the relative risks for diseases—should be directly provided by the user of the model, while other parameters are estimated by the DYNAMO-HIA parameter-estimation module from epidemiological population data. These methods are largely adapted from similar procedures used in the RIVM-CDM (Hoogenveen et al. 2010) and are described in Appendix B, together with the method used to generate an initial population from these data.

#### Example Application: Impact of BMI Reduction

As an example, here we calculate the hypothetical scenario in which the current average BMI in the Dutch population (year 2004) is reduced to the values of 1990. As the distribution of BMI is slightly skewed (see Fig. 2), we modeled BMI with a lognormal distribution. After preliminary model selection, we fitted a third-grade polynomial model of age, with log(BMI) as the dependent variable, separately for both genders, including an interaction term with year and assuming constant residual error variance on the log(BMI) scale (Fig. 3).

Transition rates for BMI were chosen so that the average age-specific BMI remained constant over time (thus equal to the 2004 and 1990 distributions, respectively; see Appendix B for details). Diseases included in the model were ischemic heart disease, stroke, diabetes, colorectal cancer, esophageal cancer, and breast cancer. Diabetes was considered to be a causal disease for ischemic heart disease and stroke. Stroke and ischemic heart disease were modeled as diseases that were acutely fatal, and no relative risk for total mortality was used. We used disease data and relative risks from the DYNAMO-HIA tutorial data set.

Figures 4 and 5 illustrate the parameter calculation procedures used. Figure 4 shows the input “incidence of diabetes in the entire population” together with the parameter “baseline incidence of diabetes”: that is, the incidence of diabetes in those with a BMI of 22.5 kg/m^{2}. It can be seen that while at around age 70, the diabetes incidence in the input data is higher in women than in men, this is no longer the case for the baseline incidence. This shows that the higher population incidence of diabetes in women of this age results from their (on-average) higher BMI. The slight discontinuities, visible in the baseline incidence at around ages 60 and 75, are due to the discontinuity of the relative risks used, which were constant within broad age groups (e.g., a single relative risk was used for all ages 60–75).

Figure 5 shows the input excess mortality for diabetes, as well as the attributable mortality calculated from it. This calculation removes from the excess mortality the mortality that is attributable to the higher prevalence of ischemic heart disease and stroke in diabetes patients compared with the general population. At young ages, the effect of this exclusion is small; but at higher ages the attributable mortality is clearly lower than the excess mortality.

Figure 6 shows the DYNAMO-HIA output-window that displays a population pyramid at simulation year 20. In this pyramid, the number of people that have at least one of the six diseases in the model is shown in the darker area in the middle of the pyramid. The reduction in the number of diseased persons (i.e., individuals having at least one of the modeled diseases) relative to the reference (2004) scenario is given by the narrow lighter adjacent area. Similarly, the reduction in the number of people alive is given by (here hardly visible) darker areas on the outside of the pyramid. In this example, it is evident that a reduction in the BMI of the population to the 1990 levels at most ages leads to less mortality (larger population) and to a lower number of diseased persons in the population. A very close look at the figure would reveal that at the highest ages the number of diseased persons is larger under the “return to 1990 BMI-values” scenario (i.e., lower BMI). This result is due to the fact that more people stay alive under the alternative scenario and also that a part of this group has a disease. From the simulated data, one can calculate that the cohort life expectancy of a newborn changes from 77.66 under the “business as usual” scenario to 77.84 under the “return to 1990 values” scenario in men and from 82.04 to 82.27 years in women, while the disease-free life expectancy changes from 70.46 to 70.90 in men and from 73.79 to 74.40 in women. As indicated by Fig. 6, other plots can be made in the DYNAMO user interface to study different aspects of the simulated data. There is also an option to output the simulated numbers for further processing in other programs.

## Discussion

The approach described herein, which has been implemented in the DYNAMO-HIA model, combines the best of existing model approaches with the aim of constructing an efficient Markov model–based disease model. To this end, micro-simulation for risk-factor modeling is combined with macro-simulation for calculating disease prevalence to minimize the running time for a specific accuracy.

This approach, although still flexible, is more limited than a full micro-simulation approach.

In partial micro-simulation, a simulated individual is not assigned a disease state, but only a disease probability. Therefore, modeling situations in which the transition rates depend on time elapsed since disease onset or where risk factor changes depend on the disease state are problematic. In terms of mortality, the first restriction is alleviated in the DYNAMO-HIA model by providing two disease processes (“with cured fraction” and “acutely fatal”) that respectively model the case in which excess mortality attributable to a disease declines exponentially with time since diagnosis and that in which there is a short window of high mortality immediately after diagnosis. For other situations, the full micro-simulation approach might be more appropriate.

The risk-factor histories are simulated by using micro-simulation; and here, transitions can depend on the length of time in a particular risk-factor state.

In partial micro-simulation, each combination of diseases forms a different state. We have avoided the curse of dimensionality by splitting the disease-space into disease clusters that are mutually independent, given the specific risk-factor history. This step assumes that such independent clusters do actually exist. In theory, it is possible that many dependencies between diseases do exist; however, in practice, many of these are weak (e.g., cancer seems to be largely independent of cardiovascular diseases, apart from joint risk factors; see van Baal et al. 2011) and/or data to quantify the amount of dependence are lacking. Hence, modeling disease combinations as independent clusters is the most reasonable option. In order to explore the consequence of unjustly assuming independence, we also performed our example calculation without the correlation among diabetes, stroke, and ischemic heart disease. The years with disease changed by less than 4 %; the differences between scenarios changed by at most, 1 %. Because the dependency among diabetes, stroke, and ischemic heart disease is stronger than that between most other diseases, this result indicates that the assumption of independence will not unduly influence results.

A Markov model projects both future disease states (disease prevalence) and future transition rates (incidence and mortality). Projecting future prevalence rates of disease is relatively rare in chronic disease modeling: many models project only incidence and mortality. Projecting prevalence, however, is necessary to calculate summary measures of population health, such as health expectancy or DALEs. The reason for the scarcity of models that project prevalence is that such a projection requires mutually consistent prevalence, incidence, and excess mortality data to prevent unrealistic projections. Checking data quality and consistency is thus a prerequisite for models which project prevalence. The DISMOD software (Barendregt et al. 2003) has been developed to check such consistency. Although this requirement for consistency complicates the process of collecting input data for the model, it is not restricted to only our model: for example, methods for calculating disability-adjusted life years (DALYs) require disease duration as input, which is based on equivalent DISMOD calculations.

Most of the data used in modeling are subject to uncertainty. Probabilistic (or Monte Carlo) uncertainty analysis (Andronis et al. 2009; Boshuizen and van Baal 2009; Briggs et al. 1994) can be used to estimate the uncertainty in model outcomes from the uncertainty in the input data. To facilitate the use of such methods, the DYNAMO-HIA model can be run in batch mode, making it possible to automate rerunning it many times on data sets generated using such a Monte Carlo approach.

In summary, we believe that partial micro-simulation is a good compromise between flexibility and reasonably short running times. Consequently, the approach has been implemented in the DYNAMO-HIA model for use in health impact assessment.

### Appendix A: Partitioning the Calculation of the Transition Probability Matrix for Clusters of Independent Diseases

In the proposed model, every simulated individual has a transition state probability matrix Π(Δ*t*) that describes the transition rates between all disease states for a time-step Δ*t*.

**Q** is a matrix of dimension (2^{n} + 1) × (2^{n} + 1), where 2^{n} is the number of distinct disease states in the model (*n* being the number of diseases), and death is the last, absorbing, state.

To construct this matrix, only the 2^{n} × 2^{n} upper left matrix, further referred to as **T**, is needed, as the remaining entries can be derived using .

*t*) will be of the form with , as the last row of

**Q**

^{2}(and thus also other powers of

**Q**), will contain only zeros, and its other rows will sum to 0. Entries in last column of

**Q**will influence only the entries in the last column of

**Q**

^{2}(and other powers of

**Q)**, and thus the upper left 2

^{n}× 2

^{n}matrix of Π(Δ

*t*), subsequently referred to as S(Δ

*t*), is equal to the matrix exponential of

**T**Δ

*t*. Given S(Δ

*t*) and the constraints , Π(Δ

*t*) can be derived, and thus the problem of finding Π(Δ

*t*) from

**Q**can be simplified to finding S(Δ

*t*) from

**T**.

As a first step, the calculation of S(Δ*t*) is partitioned into two separate calculations of: (1) the effect of “other-cause mortality” (*OC*), which is mortality that does not depend on the particular disease state; and (2) disease-related transitions. As mortality rates cause transitions only out of the current state (into death, which is not part of the states in **T**), mortality rates will contribute only to the diagonal terms of **T**. As *OC* does not depend on the disease state, all diagonal terms of **T** contain the same term—namely, –*OC*—and the transition rate matrix **T** can thus be written as T^{*}– *OC***I**, where **I** is the 2^{n} × 2^{n} identity matrix, and **T*** is the transition rate matrix without the *OC* terms, containing only terms from incidence, recovery, and disease attributable mortality rates.

**A**and

**B**,

*e*

^{A + B}

**=**

*e*

^{A}

*e*

^{B}. It can easily be seen that

**T***and –

*OC*

**I**are commuting, and thus thereby partitioning the calculation of S(Δ

*t*) in calculating

*e*

^{T*}and

*e*

^{–OC}. The calculation of

*e*

^{T*}can further be partitioned into multiplying separate matrices for mutually independent clusters of diseases. We will show this herein for two clusters of diseases; given that this step is valid for two clusters of diseases, it is also valid for any arbitrary number of clusters.

**C**with elements

*c*

_{k}; and cluster 2, containing the set of disease states

**D**with elements

*d*

_{m}. The overall set of states comprises all combinations

*c*

_{k}

*d*

_{m}. Independence of the clusters of diseases implies the following:

(1) The transition rate between the state of

*c*_{k}*d*_{m}and*c*_{l}*d*_{m}is the same for all*m*. Similarly, the transition between*c*_{k}*d*_{m}and*c*_{k}*d*_{n}is the same for all*k*. In other words, incidence, recovery, and attributable mortality of diseases in a cluster do not depend on the presence or absence of a disease or disease combination in the other cluster.(2) The prevalences of the diseases are independent of those of the diseases in the other cluster: that is, conditional on being alive, the probability of being in state

*c*_{k}of cluster 1 and in state*d*_{m}of cluster 2 is given by

Furthermore, because we are dealing with rates, all transitions should take place in an infinitely small time interval. Only a single transition is possible within this period, and it is therefore not possible to acquire more than one disease; consequently, all transition rates between states that differ by the presence of two (or more) diseases are set at 0. This assumption implies that for both *k* ≠ l and *m* ≠ *n*, the transition rate between *c*_{k}*d*_{m} and *c*_{l}*d*_{n} is 0. We will illustrate our model with an example in which cluster 1 is arbitrary and cluster 2 contain two diseases. From this example, however, we will generalize to many diseases in cluster 2.

**A**

_{1}, is the matrix of the transition rates between the states

*c*

_{k}, where the number of diseases is

*N*, and their mutual dependence is not further specified. The second cluster contains two diseases, and the states

*d*

_{m}are 00 (both diseases absent), 01 (disease 1 present), 10 (disease 2 present), and 11 (both diseases present). The matrix

**A**

_{2}of the transition rates between the states within cluster 2 is then of the form: .

_{ij}is replaced with the corresponding epidemiological parameter: where

(1)

*inc*_{1}is the incidence rate of disease 1 in those without disease 2.(2)

*inc*_{2}is the incidence rate of disease 2 in those without disease 1.(3)

*inc*_{3}is the incidence rate of disease 1 in those with disease 2.(4)

*inc*_{4}is the incidence rate of disease 2 in those with disease 1.(5)

*rec*_{1}is the recovery rate of disease 1 in those without disease 2.(6)

*rec*_{2}is the recovery rate of disease 2 in those without disease 1.(7)

*rec*_{3}is the recovery rate of disease 1 in those with disease 2.(8)

*rec*_{4}is the recovery rate of disease 2 in those with disease 1.(9)

*am*_{1}is the attributable mortality in those with disease 1.(10)

*am*_{2}is the attributable mortality in those with disease 2.(11)

*am*_{3}is the attributable mortality in those with both disease 1 and disease 2.

The overall matrix of transition rates (between all disease states combining both clusters of diseases) is an *N* × 4 matrix, which we can arrange in blocks with similar states of cluster 2 diseases. As transition rates can only be nonzero when there is only a single change between states, cluster 1 transitions are 0 in all off-diagonal blocks (the presence of which already implies a change in cluster 2 diseases), and the block will consist of *λ*_{ij}**I** because *λ*_{ij} is the same for all states *c*_{k} in this block owing to the independence between the clusters. Similarly, the off-diagonal elements of the diagonal blocks will not contain any cluster 2 transitions, while the diagonals of these blocks will be equal to the corresponding entries of **A**_{1}. Transitions involving both clusters are possible only on the diagonal of the matrix, and these comprise the diagonal elements of **A**_{1} and *λ*_{ii}**I**. Therefore, the resulting matrix can be written as

This matrix can be written as the sum of two matrices, each containing only transition rates from one of the two clusters:

**A**

_{1}and

**A**

_{2}, with dimensions

*N*and

*M*, respectively: where ⊕ represents the Kronecker sum, and thus

*t*

_{0}) are mutually independent, the prevalence rates will remain independent during simulation. By “independence of prevalence,” we mean that where

*c*

_{k}and

*d*

_{m}are specific disease states within disease clusters 1 and 2, respectively.

The symbol **p**_{1}(*t*) is used here for the vector of probabilities Pr(*c*_{k} | *alive*) (the vector of prevalence rates of cluster 1 disease states); **p**(*t*) is similarly used for the vector Pr(*c*_{k} ∩ *d*_{m} | *alive*).

**P**(

*t*

_{0}) is equal to

**p**(

*t*

_{0}) and can thus be written as where

**P**

_{1}(

*t*

_{0}) is the vector of state c

_{k}probabilities for the first cluster of disease, and

**P**

_{2}(

*t*

_{0}) is the vector of state

*d*

_{m}probabilities for the second cluster of disease.

**P**(

*t*

_{1}) with

*t*

_{1}=

*t*

_{0}+ Δ

*t*is then given by where P

_{2}(

*t*

_{1}) is introduced as shorthand for and can be interpreted as the probabilities of states in the second cluster of diseases at

*t*=

*t*

_{1}in the hypothetical case that only mortality from this cluster would occur. To calculate

**p**(

*t*

_{1}), we need to divide

**P**(

*t*

_{1}) by the survival at time

*t*

_{1}, which is equal to where

*p*

_{x,k}(

*t*

_{1}) is the

*k*th element of vector

**P**

_{x}(

*t*

_{1}).

In calculating surv_{1} and surv_{2}, we do not have the state “death” as part of **P**, so that the columns of the transition matrixes **A**_{1} and **A**_{2} do not add up to 1, but to something smaller, implying that with every update, the size of **P**_{1}(*t*) and **P**_{2}(*t*) will diminish because of death related to the cluster diseases.

Dividing **P**(*t*_{1}) by the survival at time *t*_{1} yields

Therefore, to run the model, one does not need to store all *N* × *M* disease states in *P*(*t*); it is sufficient to store the disease states *p*_{1}(*t*) and *p*_{2}(*t*) (*N* + *M* states), as well as the overall survival. Similarly, one does not need to exponentiate the (*N* × *M*) × (*N* × *M*) disease state matrix **T**, but can update *p*_{1}(*t*) and *p*_{2}(*t*) independently with the smaller matrixes **A**_{1} (*N* × *N*) and **A**_{2} (*M* × *M*).

### Appendix B: Estimation of Initial State Occupancy and Model Parameters From Marginal Data

This appendix describes how the DYNAMO-HIA model uses marginal data on risk factor and disease prevalence, incidence, and mortality to estimate the initial occupancy of states, and the parameters for the transition rates between states. The methods used are largely taken from similar methods used in the RIVM-CDM (Hoogenveen et al. 2010). All calculations described herein are performed separately by age and gender.

Initial Occupancy of States

The conversion of marginal distributions of the risk factor and of the diseases into a joint distribution is based on the central assumption that the prevalence odds ratio of having a disease, conditional on all other aspects of the state (i.e., risk factor status or the presence/absence of other diseases) is equal to the relative risk of the incidence of that disease. This assumption is considered to be justified because under certain conditions, the ratio of the prevalence odds of a disease equals the ratio of incidences (Alho 1992; Neogi and Zhang 2006). The assumption is motivated by a similar assumption used in the RIVM-CDM (Hoogenveen et al. 2010) where, however, this assumption is made for the ratio of the prevalence.

*r*(we further refer to these as “independent diseases”), we solve the baseline odds of disease

*i*iteratively from where Pr(

*d*

_{i}= 1) is the (marginal) prevalence of disease

*i*(input to the model); Pr(

*R*=

*r*) is the (marginal) probability of the risk factor state

*r*(input to the model); and is the relative risk of risk factor state

*r*on the incidence of disease

*i*(input to the model).

With these baseline odds, we can then calculate the joint prevalence of risk factor states and independent diseases, using the relative risks. If we then define a new risk factor state, *r**, as the joint risk factor–independent disease state, we can repeat the procedure for diseases that depend on other diseases (dependent diseases) using *r**. Although in theory, the process could be repeated, adding extra layers of new dependent diseases in each step, the current implementation of the DYNAMO-HIA model is restricted to a single layer of dependent diseases.

The joint distribution of all risk factor and disease states is used to generate an initial population for the simulation. As such, the first step is to assign risk factor states until the number of simulated persons assigned almost (or completely) reaches the expected number under the intended distribution; subsequently, the remaining simulated persons are randomly assigned a risk factor value. Given the assigned risk factor, the disease state probability is then calculated based on Eq. (B1).

Transition Rates

An overview of the input needed in the DYNAMO-HIA model and of the parameters that govern the transition rates in the DYNAMO-HIA is given in Tables 1 and 2, respectively. The parameters and need to be provided directly by the user, while λ_{i→j}(Δ*t*) and ∂(Δ*t*) can either be provided directly by the user or can be calculated by the program such that the age-specific prevalence stays constant over time (so-called net transition rates). All other parameters need to be estimated from the input.

*Transitions Between Risk Factor States*

Transition probabilities for risk factors can either be provided directly by the user, or the program can estimate the minimal transitions needed to allow the age-specific prevalence to remain constant over time (net transitions). This method has been described elsewhere for categorical risk factors (Van de Kassteele et al. 2011).

*L*at age

*a*+ Δ

*t*is given by where ε is a random number with distribution

*N*(0,1).

For net transitions, ∂(Δ*t* = 1) is simply the difference of the average risk factor level *L* in the population at age *a* + 1 and that at age *a*: . σ(Δ*t* = 1) is similarly estimated from the increase in variance in the population with age. If the variance decreases with age, σ(Δ*t* = 1) is set at 0. σ(Δ*t* = 1) is always calculated by the program. Relative risks on all-cause mortality are used to adjust these parameters for selective mortality.

*Estimation of the Baseline Incidence* (*I*_{0,i})

*I*

_{0,i}from the marginal incidence rate as follows: where

*FF*

_{i}is the fatal fraction of disease

*i*(user supplied);

**C**is a vector of states of the causal diseases (e.g., for five causal diseases: [0,0,1,0,0]), with elements {

*c*

_{1}, . . . ,

*c*

_{m}}; χ is the set of all possible vectors

**C**(all possible combinations of presence/absence of causal diseases); and

*I*

_{i}is the (marginal) incidence rate of disease

*i*(user-supplied).

The overall incidence *I*_{i} and the fatal fraction *FF*_{i} are directly provided by the user. For diseases that are not directly fatal, the fatal fraction is 0.

*Estimation of the Attributable Mortality (Am _{i}), Baseline Other-Cause Mortality (M_{0,OC}), and Relative Risks for Other-Cause Mortality (RR_{r→OC})*

Here, there are multiple unknowns: namely, the attributable mortalities *Am*_{i} for each disease, the relative risks *RR*_{r→OC} for risk factor value *r* on other-cause mortality, and the baseline other-cause mortality *M*_{0,OC}. The marginal population data available for estimating these parameters are the all-cause mortality *M*, the prevalence of disease *i* [Pr(*d*_{i} = 1)], the relative risks of each risk factor state on all-cause mortality *RR*_{r→M}, and the excess mortality rates *E*_{i}*.*

The excess mortality for disease *i* is defined as the observed difference between mortality in the group with the disease and that in the group without the disease. This is not equal to the attributable mortality *Am*_{i} because excess mortality includes mortality that results from more comorbidity and a higher exposure to the risk factor in those with the disease, while attributable mortality does not include those effects.

*RR*

_{r→M}and

*M*are used to calculate

*M*(

*R*=

*r*), the all-cause mortality in risk factor state

*r*. Second, for each disease, the excess mortality

*E*

_{i}, the prevalence of disease

*i*and the total all-cause mortality

*M*are used to calculate the all-cause mortality in those with disease

*i*,

*M*(

*d*

_{i}= 1):

*M*(

*d*

_{i}= 1) and

*M*(

*R*=

*r*) in terms of the parameters that have to be estimated: where with

*CF*

_{j}(

*d*

_{i}

*=*1) as the incidence of an acutely fatal disease

*j*, given that disease

*i*is present;

*CF*

_{i}(

*R*=

*r*) as the incidence of the acutely fatal disease

*i*, given risk factor state

*r*;

**C**as the vector of causal disease states with elements

*c*

_{k}; and χ as the set of all possible vectors

**C**(all possible combinations of causal disease states). Using the joint distribution of all risk factors and disease states as estimated previously when defining the initial occupancy of states, all conditional probabilities (such as Pr(

*d*

_{j}= 1 |

*d*

_{i}= 1), Pr(

*d*

_{i}= 1 |

*R*=

*r*),

*P*(χ =

**C**|

*R*=

*r*) and

*P*(χ =

**C**,

*R*=

*r*|

*d*

_{i}= 1)) are known. Given these known terms, with

*L*diseases and

*N*risk factor states, we then have a linear system of

*L*+

*N*equations with

*L*+

*N*unknowns (

*L Am*-terms,

*N*– 1

*RR*

_{r→OC}–terms, as

*RR*

_{r→OC}is set to 1 for the reference group, and

*M*

_{0,OC}), which can be solved using standard linear algebra.

*RR*

_{r→OC}is optional; if this option is not used, risk factors influence only mortality through causing more disease. In that case, there are only

*L*+ 1 equations to solve:

*Estimation of the Baseline Fatal Incidence Rate ()*

Again, the joint distribution of all risk factors and disease states as estimated previously when defining the initial occupancy of states is used to provide *P*(χ = **C**, *R* = *r* | *d*_{i} = 1).

### Appendix C: Definition of Disability Weights in DYNAMO-HIA

*Q*attached to a particular disease state

**D**is calculated as where

*Q*

_{0}is the disability weight for a person without any of the diseases in the model, and

*Q*

_{i}is the disability weight attributable to the disease (supplied by the user).The disability weight

*Q*

_{0}for a person without any diseases is calculated from the overall disability weight for the entire population

*Q*as where is the set of all possible disease states

**D**, and

*d*

_{i}is the state of disease

*i*in

**D**.

As in Appendix B, all these calculations are carried out separately for age and gender, and is estimated in the same way as described in the section on defining the initial occupancy of states.

### Appendix D: Justification of the Numerical Methods Used

Fixed time-step simulation implies that a process in continuous time is simulated in discrete time, and thus transition probabilities (from the states at the beginning of the time step to the states at the end of the time step) have to be calculated from the continuous time transition rates. Several approaches are currently being used for this calculation in disease models.

Many models, such as the RIVM-CDM (Hoogenveen et al. 2010), use the Euler forward method, which basically ignores the difference between rates and probabilities. This approach is defendable if the time steps or rates are small or if the uncertainty associated with the rates (because of data uncertainty) is much larger than the error from this approximation. Other models (Lauer et al. 2003) use more sophisticated numerical methods of forward integration. Analytical solutions are available when transition rates **Q**(Δ*t*) are assumed to be constant within the time step Δ*t*, as the matrix of transition probabilities Π(Δ*t*) in this case is equal to the matrix exponential of Δ*t***Q**(Δ*t*) (see Appendix A). However, for all but the most simple cases, it is more convenient to compute the matrix exponential using numerical methods (Utley et al. 2003). Lastly, if transitions are in one direction only, constant within the time step, and they depend only on the disease state at the beginning of the time step, the probability of going from state A to state B sometime during the time step can be calculated as 1 – *e*^{−αt}, where α is the transition rate from A to B. For transitions to absorbing states, these probabilities are the probabilities that are needed. For other states, the probabilities can be calculated from these probabilities to absorbing states using probability calculus. This method is often used in multi-state life table models and, to the best of our knowledge, in the UKPDS model (Clarke et al. 2004).

The first two methods are general methods, and the last two use the structure of the problem at hand (a system of first-order linear differential equations) and the additional assumption of constant rates within the time step. Intuition suggests that tailored methods may be more efficient than the more general methods. However, solving a matrix exponential can also be time consuming because time increases with the square of the dimension of the matrix. Also, there is an abundance of numerical problems (Moler and Van Loan 1978). Gallivan et al. (2007) published an algorithm specifically tuned to the case of Markov disease models. Time requirements can be further reduced by splitting up the matrix into separate matrices for independent clusters of diseases (see Appendix A). DYNAMO-HIA uses the third method, using the algorithm of Gallivan et al. (2007) for calculating the matrix exponential for clusters of more than one disease, and the exact solution for single diseases (both “simple” and with a cured state).