Abstract
This study contributes to the literature on union dissolution by adopting a machine learning (ML) approach, specifically Random Survival Forests (RSF). We used RSF to analyze data on 2,038 married or cohabiting couples who participated in the German Socio-Economic Panel Survey, and found that RSF had considerably better predictive accuracy than conventional regression models. The man's and the woman's life satisfaction and the woman's percentage of housework were the most important predictors of union dissolution; several other variables (e.g., woman's working hours, being married) also showed substantial predictive power. RSF was able to detect complex patterns of association, and some predictors examined in previous studies showed marginal or null predictive power. Finally, while we found that some personality traits were strongly predictive of union dissolution, no interactions between those traits were evident, possibly reflecting assortative mating by personality traits. From a methodological point of view, the study demonstrates the potential benefits of ML techniques for the analysis of union dissolution and for demographic research in general. Key features of ML include the ability to handle a large number of predictors, the automatic detection of nonlinearities and nonadditivities between predictors and the outcome, generally superior predictive accuracy, and robustness against multicollinearity.
Introduction
The literature on union dissolution in Western countries is extensive, and reviews (e.g., Lyngstad and Jalovaara 2010; Mortelmans 2020) highlight how knowledge of the predictors of union dissolution has increased considerably in recent decades. This deepening understanding is consistent with the broader diffusion of the phenomenon itself, but is also due to the emergence of high-quality survey data.
This article offers three contributions to the literature. First, we identify predictors of union dissolution in Germany using a machine learning (ML) approach (Hastie et al. 2009; Molina and Garip 2019)—specifically, Random Survival Forests (RSF). This approach, which does not impose any sharp limits on the number of variables that can be included, allows us to account simultaneously for the numerous predictors of union dissolution highlighted by previous studies while, at the same time, permitting the exploration of nonlinearities and nonadditivities in the links between these predictors and union dissolution. Second, we focus on the role of both partners' personality traits (PTs) in union dissolution. Previous studies have not considered the PTs of both partners at the same time, nor the interactions of PTs with each other and with other predictors. Implementing this kind of analysis would be complicated with a standard regression-based approach; we show that ML offers a different approach that can handle a high number of PT items (five per partner) and potential interactions among them. More specifically, we examine the importance of PTs relative to other variables in predicting union dissolution and whether pairwise combinations of PTs (within or between partners) are associated with union survival probabilities. Third, we assess predictive accuracy, which typically has been overlooked in previous studies on union dissolution. ML is a useful approach here. For instance, conventional regression models appear to have lower out-of-sample predictive accuracy than RSF and other algorithmic approaches. This is an important advantage of ML approaches because, as Hofman et al. (2017) argue, prediction, like explanation, is crucial for theory building and validation. We also show that insights from ML algorithms can be used to better specify standard regression models.
Our study focuses on Germany and uses data from the German Socio-Economic Panel Study (SOEP) to examine union dissolution among individuals who married or began cohabiting between 1984 and 2015. SOEP is particularly useful for the purposes of our study because it includes most of the variables that previous studies employed to predict union dissolution.
In addition to offering abundant survey data, Germany provides an interesting setting for this analysis. Divorce has long been commonplace in Germany; the current duration-specific divorce rate suggests that more than 40% of marriages dissolve (Vignoli et al. 2018). However, the divorce rate, after decades of increases (Wagner et al. 2015), declined from 329 per 1,000 people in 2003 to about 218 per 1,000 in 2018, according to the German Statistical Office. The dissolution rate for cohabiting individuals has been estimated to be more than double that for married individuals (Andersson 2003; Hiekel and Wagner 2020).
From a methodological point of view, our analyses show the relevance of ML techniques in the study of union dissolution and in demographic research more generally. We show that RSF is able to automatically detect nonlinear links between union dissolution and its predictors. We also demonstrate that the method allows the examination of interactions among a large number of independent variables, thus permitting the exploration of complex patterns within relatively small, survey-based data sets. One of the primary advantages of ML methods is that they do not require the researcher to make assumptions about parametric distributions (Molina and Garip 2019). These advantages are critical, as regression-based results in social science research can be influenced by model specification choices (Young 2009) or the particular link function for the linear predictor (Young and Holsteen 2017). Moreover, some demographers may have been reluctant to use ML approaches because of the methods' black-box nature. Our analyses show how to “open the box”—that is, how to gain substantive insights from the RSF algorithm.
Previous Studies on Union Dissolution
Union dissolution occurs worldwide, but with great heterogeneity within and across countries (Emery 2013; Mortelmans 2020; Wagner and Weiß 2006). With the decline in the prevalence of legal marriage and the spread of cohabitation as a stable form of union, the term “union dissolution” has become an umbrella term for all uncoupling processes, irrespective of legal bond (Mortelmans 2020).
The European research tradition in union dissolution came long after U.S.-centered studies and found heterogeneity in dissolution within Europe, corresponding to variation in the pace of diffusion of the phenomenon. In particular, union dissolution has occurred along a strong north–south gradient; rates have been highest in Scandinavia, while unions tend to be more stable in southern Europe, though dissolution rates have increased there as well (Kalmijn 2007).
In reviewing the previous two decades of research on Europe and the United States, Lyngstad and Jalovaara (2010) offered a classification of the predictors of divorce. Here, in summarizing these predictors, we refer mainly to Europe and, when possible, specifically to Germany.
The first group of predictors comprises the personal characteristics of the members of the couple, the most important being age, age at union formation, union duration, education, personality, and subjective well-being. An individual's age is a better predictor of union dissolution than age at union formation (Lutz et al. 1991). In general, age is negatively correlated with the probability of dissolution (Lyngstad and Jalovaara 2010). The effect of education is rather complex (Härkönen and Dronkers 2006; Lyngstad and Jalovaara 2010; van Damme 2020) and appears to follow a pattern linked to the so-called Goode (1962) hypothesis, which states that education's impact varies with the prevalence of union dissolution in the country. Consistent with this hypothesis, the effect of education on union dissolution is negative in areas with high union dissolution, such as the United States and Scandinavia, but positive in countries such as Italy, where the prevalence of union dissolution, though growing, is relatively low (e.g., Härkönen and Dronkers 2006; Matysiak et al. 2014; Salvini and Vignoli 2011; Vignoli and Ferro 2009). Several studies have highlighted the importance of the interaction between partners' education levels. Homogamy tends to protect unions from breaking up (e.g., Kalmijn 1998; Kalmijn 2003; Mäenpää and Jalovaara 2014), although in Germany, Kraft and Neimann (2009) suggested that homogamy does not increase marital stability but rather higher education per se does, and Blossfeld (2014) found that homogamous marriages rank second in stability after women's educationally upward marriages. Moreover, the effect of education needs to be separated from the effects of partners' income and labor market status.
Regarding partners' personality, most of the literature on its influence on union dissolution relies on the “big five” traits (agreeableness, conscientiousness, extraversion, neuroticism, and openness). In recent studies, associations between PTs and union dissolution have been similar in the United Kingdom, Flanders, and Germany. In particular, a low level of conscientiousness and a high level of openness are significant risk factors, although the relationship between openness and dissolution has weakened as the latter becomes more common and less costly socially and economically (Boertien and Mortelmans 2018; Boertien et al. 2015).
Most of the literature on subjective well-being has investigated how happiness, or overall life satisfaction, differs by marital status (e.g., Oswald 1997) and is influenced by union dissolution (e.g., Gardner and Oswald 2006). To the best of our knowledge, no studies have explicitly assessed the role of partners' overall life satisfaction as a predictor of union dissolution. Past research, however, has found that dissatisfaction with the relationship is an important antecedent of union dissolution, especially among women (e.g., Røsand et al. 2014).
A second group of predictors of union dissolution consists of the partners' economic status and their division of paid and unpaid labor. Empirical evidence across the latter part of the past century indicates that, in many countries, employed married women were more likely to divorce than those who were not employed (e.g., Blossfeld and Müller 2002; Chan and Halpin 2002; Cooke 2004, 2006; De Rose 1992; Lyngstad and Jalovaara 2010; Ozcan and Breen 2012; Poortman 2005a, 2005b; see Vignoli et al. 2018 specifically for Germany). This gave rise to the idea of the so-called “independence effect” (Cooke et al. 2013), in which a married woman's working full-time leads to a higher likelihood of divorce, a phenomenon that appears to dominate the income effect—that is, the positive effect on union stability from the fact that the wife's resources also add to the total resources of the family. However, a new strand of studies suggests, instead, that women's employment does not have a negative effect per se, and that women's paid work becomes detrimental to union stability only if men's contribution to unpaid work within the household is limited (e.g., Mencarini and Vignoli 2017). In light of this, gender equality within couples in the sharing of domestic chores becomes an important positive predictor of union stability (e.g., Cooke et al. 2013; Frisco and Williams 2003; Oláh 2001; Oláh and Gahler 2014; Sigle-Rushton 2010; and, specifically for Germany, Bellani and Esping-Andersen 2020; Bellani et al. 2018). Similar effects are found in terms of earning equality, especially for cohabiting couples and for younger married couples (Ishizuka 2018; Jalovaara 2003; Kalmijn et al. 2007).
A third group of predictors includes the couple's characteristics, such as whether they are married (as opposed to cohabiting), their number of children, and union duration. Rates of union dissolution are generally higher for cohabiting than for married couples, a pattern that appears to be independent of the presence of children (e.g., Andersson 2002; Berrington and Diamond 1999; Liefbroer and Dourleijn 2006). However, the difference seems to be explained (at least in part) by self-selection into cohabitation or marriage (e.g., Svarer 2004). Furthermore, whether a couple cohabited prior to marriage seems to have mixed effects on union stability (Lyngstad and Jalovaara 2010). The risk of dissolution among couples who marry after cohabiting is generally lower among those with children, and is especially low during the period following the birth of the first child. However, this lower risk appears to be partially driven by selection, whereby partners who have little trust in the continuity of their union are less likely to have children (Lyngstad and Jalovaara 2010). Finally, union duration is negatively correlated with the probability of dissolution (Lyngstad and Jalovaara 2010).
A last important variable, which is specific to Germany and included in our analysis, refers to the East–West divide. East Germans, other factors being equal, are more likely than West Germans to divorce (e.g., Boertien and Mortelmans 2018).
The factors listed in the foregoing are the predictors most commonly considered in union dissolution studies. However, specific studies, using particular surveys, have found other important characteristics linked to union dissolution, such as biological and genetic characteristics, migration and minority status, values and religiosity, and whether the partners' own parents divorced.
Despite the many studies on union dissolution in Europe (including Germany), the literature certainly presents some gaps. First, the reasons behind the often mixed and heterogeneous results are not well settled and may have been driven by model specification issues (i.e., different parametric specifications). As an example of nonlinear effects, Cooke and Gash (2010) used a categorical measure of wives' employment hours and admitted that if they coded this measure as continuous they would have needed different model functional forms for each country considered: namely, linear for West Germany, quadratic for Great Britain, and third-order polynomial for the United States. Another example, quoted in Lyngstad and Jalovaara's (2010) review, regards the wife's income, which often displays a nonlinear relationship with union dissolution.
Second, studies have tested interactions between specific independent variables. For example, Cooke and Gash (2010) found evidence of nonadditive effects (i.e., interactions) between children's age and women's employment, and Mencarini and Vignoli (2017) reported an interaction between women's employment and men's contribution to housework. However, previous studies using parametric models could not consider all (not even many) interactions between independent variables simultaneously. A clear-cut example of the potential difficulties of considering all variables and their possible interactions concerns personality traits: to account for both partners' PTs (10 variables) and all their two-way interactions (25 variables), one would need to include 35 independent variables, which would be very problematic in a regression model. Thus, despite our increasing understanding of union dissolution, there is still a strong need for in-depth studies to disentangle interwoven and matted complexities.
Third, previous studies have focused on explaining findings but have overlooked the importance of outcome predictive accuracy. Addressing these gaps demands that we account for a large set of union dissolution predictors and their interactions, and this may not be feasible with conventional event-history regression models. It is better accomplished using an ML-based approach.
Methods
Our study relies on an ML technique, RSF, because of the advantages it offers, which are outlined below. However, given that RSF is an unconventional approach in the union dissolution literature, we also present results from standard discrete-time event-history logit models (“DT models” henceforth). Our goal is not to directly compare the results from two fundamentally different approaches, but to bring out the different implementation and information content of the new approach. In this regard, it is worth stressing that the conventional DT model and the RSF are completely different approaches by nature. DT models and RSF belong to what Breiman (2001a) refers to, respectively, as data modeling culture and algorithmic modeling culture. Data modeling culture is based on a “representation of the mechanisms by which nature works,” while algorithmic modeling culture “is concerned solely with linking inputs to outputs” (Berk 2016:327). RSF, like similar algorithms, can be treated as a form of regression analysis (Berk 2016). However, there is a key difference between them and conventional parametric regression models: algorithms do not require for a model of nature to be stated, and the estimation target can be considered the best approximation to the true response surface, from which the researcher can get useful information about how the outcome is related to its predictors.
The two statistical cultures have developed largely in parallel. Data analysis culture has focused almost exclusively on explanation and has overlooked the importance of predictive accuracy, while algorithmic approaches have tended to focus on prediction (for further discussion, see Berk 2016; Hofman et al. 2017). Recently, a rapidly growing movement hybridizing the two cultures has emerged, and has asserted the need to combine explanation and prediction (Hofman et al. 2017) and suggested methods for improving the interpretation of findings from algorithmic approaches (e.g., Zhao and Hastie 2021). We follow these developments by recognizing the importance of prediction in substantive studies. Thus, although our main goal is to examine the possibly complex way in which a large set of predictors may be related to union dissolution, we use an ML technique also to improve the predictive accuracy of union dissolution.
Statistical inference with RSF, as with most ML techniques, is an unsolved problem because of these methods' inductive nature and their use of data snooping (see Berk 2016 for a discussion of this issue and for references to recent approaches to statistical inference with random forests). This is another key difference between algorithmic and model-based research, as the latter typically relies on considerations of statistical significance.
In the following sections, we will first briefly discuss some issues concerning conventional DT models used to analyze union dissolution and then focus on RSF. Note again that the inclusion of both approaches is not done with the aim of directly comparing their results. Instead, reporting results from a conventional DT model serves to clarify the advantages offered by RSF in the study of union dissolution.
DT Models: Issues in the Analysis of Union Dissolution
Previous studies on union dissolution have often relied on DT models, which model the probability of experiencing union dissolution during a given interval as a function of a set of independent variables and of time spent in the risk set (duration). Applying parametric models, such as DT models, may prove difficult for our research goals for several reasons. First, as the preceding literature review suggests, our reanalysis of the predictors of union dissolution has to account for a large number of independent variables. For instance, PTs alone account for 10 independent variables (five for each partner). Second, we are interested in understanding whether characteristics of the same individual or of the partner interact in influencing union dissolution. Several studies have identified important nonlinear and nonadditive effects of predictors of union dissolution that should be considered when specifying the model. This implies that an intractable number of interactions should be included in a regression model. Third, several antecedents of union dissolution are correlated (e.g., partners' education, income, and working hours) and consequently their inclusion may give rise to multicollinearity.
Another limitation of previous studies based on DT models concerns the way model fit is commonly assessed, especially with respect to different choices for the duration, which results in models that differ in their baseline hazard function. Demographers typically choose the best specification of the baseline hazard function according to measures of model fit, like the Akaike information criterion (AIC), for which lower values denote a better fit. Model fit is evaluated using the same sample (in-sample) from which the model is estimated, giving rise to possible issues of overfitting, meaning that prediction quality can be very good in the specific working sample, but not in others (out-of-sample). This, of course, limits external validity (i.e., generalizability to other data sets) (Hofman et al. 2017). We follow the common practice in ML of separating estimation from the evaluation of predictive accuracy (see Berk 2006); that is, DT models will be estimated on a randomly drawn subsample (training set), and predictive accuracy will be assessed on the remaining part of the initial sample (testing set).
The predictive accuracy of DT models will also be assessed on the basis of the area under the curve (AUC). In the context of survival analyses, time-dependent AUC measures can be calculated for assessing how predictive accuracy varies at different time points. An AUC of .5 is not better than random guessing, while an AUC of 1 indicates perfect predictive accuracy. Because AUC does not distinguish between false positives and false negatives, we assume that the costs of incorrect predictions of union survival are the same as those for incorrect predictions of union dissolution.
Random Survival Forests
RSF belongs to the category of supervised ML (SML) techniques.1 Generally speaking, ML algorithms can be defined as procedures that give computers the ability to learn without being explicitly programmed (Samuel 1959). SML techniques are a subgroup of ML techniques that consist of iterative algorithms and are focused primarily on prediction problems.
RSF is a modification of random forests for survival data, which in turn is an ensemble of single classification trees (Breiman 1996; Breiman et al. 1984). A classification tree (usually known as a regression tree in the case of a continuous outcome) uses a recursive algorithm to describe the relationship between a set of independent variables and a given outcome. At each step, the algorithm splits the data into subsets. Splits can occur between any two subsets of the observed values of any of the predictors. Among all possible splits, the algorithm selects the one that minimizes the prediction error (for more details, see James et al. 2013; Shu 2020). A classification tree captures nonlinearities in covariates by automatically splitting them into different intervals. Another general feature of trees is the automatic detection of interactions. In this context, a two-way interaction between two variables is found if a split in one variable makes a split in the other either systematically less likely or more likely.
Random forests are among the most popular and best-performing SML techniques (Breiman 1996, 2001b; Cannas and Arpino 2019; Glaeser et al. 2016). Random forests are “ensemble” algorithms based on many multiple trees (Berk 2006). The prediction obtained with random forests is constructed on the basis of hundreds or thousands of distinct trees that differ from one another because each tree is fit on different data obtained as bootstrap samples of the original data. Additionally, each tree is grown using a random subset of variables at each split. A single prediction is obtained by majority vote across all trees: the class predicted by each tree is recorded, and the overall prediction is the most commonly occurring class (James et al. 2013). By relying on a multitude of trees, random forests reduce problems of overfitting and usually show better prediction quality than single trees (Athey and Imbens 2017). However, while some rare examples of demographic studies using single trees exist (Billari et al. 2006; De Rose and Pallara 1997), we are not aware of any implementation of ensemble SML techniques in demographic analyses, with the exception of studies on mortality (e.g., Bitew et al. 2020).
We use a modification of random forests, the RSF, which is an ensemble tree–based method for the analysis of right-censored survival data (Hothorn et al. 2006; Hothorn et al. 2004; Ishwaran et al. 2008). More specifically, we use the algorithm implemented in the package randomForestSRC in the open-source environment R (Ishwaran and Kogalur 2014), which builds an ensemble of hazard functions using the log-rank statistic as the default splitting rule.
Like other tree-based approaches, RSF has several advantages (James et al. 2013). It can handle automatically (i.e., without need for recoding, grouping, etc.) continuous, nominal, and ordinal independent variables, and can automatically capture nonlinear effects and interactions. Another important attribute of RSF is its ability to use a large number of predictors, even if most of them are correlated.
To assess the performance of the algorithm in predicting survival, we first look at the out-of-bag (OOB) error rate. RSF does not need an independent validation data set to get an unbiased estimate of the test set error, as this is estimated internally during the run of the algorithm. More specifically, each tree in the forest is constructed by bootstrapping a sample from the original data. Sampling with replacement implies that for each tree, a random sample of about a third of the observations are excluded from the calculations—OOB observations—giving rise to an immediate source of test data (Berk 2016:192). Then, each OOB observation in the construction of the kth tree is dropped down the tree, and the algorithm estimates the percentage of times that the class assigned to OOB observations is not equal to the true class. Finally, the total OOB error is obtained as the average of these estimates across all the trees in the forest. As with the DT models, we assess the predictive accuracy of RSF by also estimating AUCs at different time points.
A disadvantage of RSF (as with all ensemble methods) is that by combining multiple trees, it does not offer a single tree to visualize and interpret: that is, it is difficult to understand how inputs are related to the output. However, several measures can be calculated to ease interpretation. First, it is possible to obtain a measure of variable importance (Zhao and Hastie 2021). We focus on measuring the importance of a variable by its contribution to predictive accuracy. More specifically, the variable importance measure (VIMP) that we calculate for variable X is the difference between the prediction error when noise in the original variable X is added by permuting its values randomly and the prediction error under the original predictor (Breiman 2001b; Ishwaran 2007; Ishwaran et al. 2008). Large VIMP values are linked to variables with some predictive power (since inducing noise in these variables increases prediction errors), whereas zero or negative VIMP values identify variables that are not predictive of the outcome. Setting thresholds to separate highly predictive variables from others using VIMP is difficult (Ishwaran et al. 2011), and distinguishing variables according to their predictive ability is usually done rather arbitrarily by ranking variables based on their VIMP and considering thresholds where there is a large decrease in VIMP values. We complement the ranking of variables based on VIMP with the inspection of partial dependence plots.
Second, to reveal how each predictor is related to the outcome, one useful solution is to produce partial dependence plots (Friedman 2001; Hastie et al. 2009) that show the relationship between a given predictor and the response averaged over the joint values of the other predictors as they are represented in a tree structure. Similarly, partial dependence co-plots represent how the predicted outcomes vary as a function of the joint distribution of two or three predictors, thus making it possible to visualize potential interactions. Results from RSF can be used to automatically detect the presence of independent variables whose pairwise interaction may be better investigated using partial dependence co-plots. The approach involves calculating joint VIMPs for all pairs of predictors and has two steps. First, two variables X and Z are paired, and their paired VIMP is calculated, as mentioned earlier, for a single variable. Second, the separated VIMPs for X and Z are also calculated and their values are added up. Then, conditional to a relatively large univariate VIMP for both X and Z, the larger the absolute difference between the paired VIMP and the additive VIMP, the more potentially interesting the interaction between the two variables is.
Data and Variables
We used data from SOEP, a nationally representative ongoing longitudinal study started in 1984. SOEP is well suited for the study of union dissolution for two reasons. First, the length of the study allows us to follow individuals over a long period. Second, it includes information on union dissolution and the main predictors that were identified in past studies.
In order to have information on both partners, we selected, from the original database, women whose partners were also surveyed. In particular, we included women aged 65 years or younger who started their relationship during the observation period (1984 to 2015). The final sample consisted of 18,613 observations, corresponding to 2,038 couples, married or in a cohabiting union, who were observed, on average, for 12 years.
For the dependent variable, we constructed a dummy variable that measured union dissolution and that was equal to 1 when we observed a change in a woman's partner from year t – 1 to t, and to 0 otherwise. After union dissolution, we stopped following both members of the couple, which means that our sample included no more than one union dissolution per individual. During the observation period, 914 couples (45%) split up.
On the basis of the literature review, we considered 28 predictors of union dissolution measured for both members of the couple or at the couple level. Table 1 lists all these variables and provides summary statistics. A description of the operationalization of each variable is given in the online Appendix A, while Appendix B provides the reader with the R code for replicating the results of the analysis introduced in the next section.
Results
Discrete-Time Event-History Models
We considered five different specifications for duration in the DT models: linear, quadratic, cubic, step function (dummy variables for each year), and b-spline. For each of these model specifications, Table 2 presents the AIC and AUC at different survival times (one, five, 15, and 25 years), both in-sample and out-of-sample. The former was obtained by using the entire sample of couples; for the latter, we randomly split the original sample into two parts of equal size, which were then separately used as training and test sets, and calculated the measures of predictive performance using the test set. Table 2 shows very clearly that while all models were fairly good at making in-sample predictions, they performed very poorly at out-of-sample predictions. In almost all cases, models were only slightly better than a simple random guess (i.e., AUC = .50).
Estimates and the statistical significance of all coefficients were very similar across all the DT models. Thus, in Table 3, we focused on estimates from model DT3, which shows the best predictive accuracy. In this model, only six variables had coefficients that were statistically significant. This model indicates that higher life satisfaction for both partners, as well as older age for women, was significantly associated with a lower probability of union dissolution. On the other hand, older age among men, higher openness scores among men, and higher number of working hours among women were associated with a greater risk of dissolution. That only six independent variables were associated with union dissolution is perhaps due to multicollinearity, which may inflate standard errors. We checked for multicollinearity in model DT3 using several diagnostics available in the R package mctest (see Imdadullah et al. 2016 for details on the diagnostics). Results are given in Tables A.2 and A.3 of the online Appendix C. We first considered six tests of overall multicollinearity. Four of these tests detected multicollinearity issues (Table A.2). Then we considered seven tests for individual multicollinearity for each of the independent variables (Table A.3); three of these tests detected multicollinearity issues for each of the independent variables.
The DT models presented here can be made more flexible by including specific interactions and nonlinear terms, usually introduced to test specific hypotheses. In the following analysis, we took a different approach by implementing an RSF. We asked the algorithm to automatically detect potentially relevant interactions and nonlinearities. This approach does not suffer from the multicollinearity issues that, in our data set, were likely to have affected the statistical significance of coefficients in model DT3.
RSF Algorithms
The RSF is based on a multitude of trees. Each tree is based on an iterative algorithm illustrated in Figure A.1 in the online Appendix C. This illustrative tree demonstrates how classification proceeds at each node.
Assessing the Algorithm's Performance
Before the RSF algorithm is run, three important parameters have to be set: the size of the terminal nodes, the number of variables to be randomly selected at each split, and the number of trees to be grown in the forest. For the first two parameters, we implemented the algorithm developed by Ishwaran and Kogalur (2014), which searches for the combination of parameters that minimizes the OOB error rate. The contour plot in Figure A.2 graphically represents the analysis generated through the algorithm. This shows that the OOB error rate was minimized when the size of terminal nodes was set to one and the number of variables randomly selected at each split was four.2 After setting the values for the first two parameters, we fine-tuned the third (number of trees grown in the forest). Again, the number of trees that are grown can be determined by choosing the value that minimizes the OOB error rate. In our case, the OOB error rate stabilized around 35% when at least 500 trees were employed (see Figure A.3 in the online Appendix C). We chose to use 1,000 trees, which yielded a final OOB error rate of 34%. Note that although it increases computational time, choosing a high number of trees does not create problems in terms of overfitting (Breiman 2001b).
In Table 4 we report the AUC calculated at one, five, 15, and 25 years. The predictive accuracy of the algorithm fell with duration: it was highest at one year and lowest at 25 years. We note that the predictive accuracy (out-of-sample) of RSF was considerably superior to that of DT, consistent with the usual finding that SML improves prediction. Nonetheless, the predictive accuracy of RSF was limited despite the use, as input variables, of all the most important predictors of union dissolution identified in the literature.
Variable Importance
Figure 1 shows the VIMP of each variable used in the RSF. Among the variables with the greatest predictive ability, we find the life satisfaction of both partners, woman's percentage of housework, marital status (i.e., married vs. cohabiting), woman's working hours, woman's level of openness, and man's level of extraversion. Variables with the lowest predictive ability include man's and woman's labor income, woman's percentage of labor income, if woman is richer than man, and woman's level of extraversion. Note that the sign of VIMP is not informative of the sign of the association between the predictor and the outcome, and its value is not informative of the strength of the association between a predictor and the outcome.
Interpreting the Relationship Between Independent Variables and Union Dissolution
To gain insight into how each independent variable is related to union dissolution, we created partial dependence plots calculated at one and five years for all continuous (Figure 2) and categorical (Figure 3) predictors, ordered by their VIMP. Each point in both figures represents the average percentage of votes for the “yes” class (union dissolution) across all observations, given a fixed level of the predictor. Conceptually, this type of plot is similar to a graphical representation of the predicted survival probabilities as a function of a given variable.
Figure 2 illustrates the type and strength of relationship existing between each predictor and union survival probabilities. We found that both partners' levels of overall life satisfaction positively predicted survival: the higher the satisfaction, the higher the survival probability. The variation in survival probabilities, as a function of life satisfaction, was much stronger for couples whose union remained intact after five years. We also note a slight nonlinearity, as survival probabilities peaked when life satisfaction was at eight points and then declined.
From Figure 2 we recognize an important feature of tree-based approaches: their ability to automatically detect complex relationships. For example, survival probabilities varied with both the man's level of extraversion and his age according to nonlinear patterns that might be difficult to properly model with a parametric model. For example, a quadratic specification may only approximate the patterns displayed.
Figure 2 is also informative about how strong the variation in predicted survival probabilities was for different values of a given independent variable. We note a rather strong variation for some variables, such as woman's life satisfaction. In this case, we observe a variation in survival probabilities at five years of about four percentage points if we compare couples in which the woman reported the lowest as opposed to medium-high level of overall life satisfaction (i.e., of eight points).
Figure 3 shows the predicted survival probabilities for each value of each categorical variable, highlighting that RSF does not require grouping categories for this type of variable. We note that survival probabilities varied only slightly for the categorical predictors, with the exception of marriage status, education, and health of both partners. Married couples had substantively higher survival probabilities than cohabiting couples, at both one and five years. In contrast, after five years, couples in which the woman was more educated than the man were less likely to survive compared with other partners' educational pairings. Finally, survival probabilities, especially at five years' duration, seem to vary nonlinearly for partners' health, with the highest values reported for medium levels of health.
To investigate possible interactions among predictors, we calculated joint VIMP values starting from the seven most important predictors (according to the simple VIMP measure presented in Figure 1). The joint VIMP values are reported in Table A.1 in the online Appendix C. Among the interactions worth exploring was that between the woman's and the man's life satisfaction, which had both the highest difference between paired VIMP and additive VIMP and two of the highest univariate VIMP values.
We investigated the pairwise interactions between variables using partial dependence co-plots. In particular, Figures 4, 5, and 6 show the partial dependence co-plots with respect to the couple's survival probability at five years. Figure 4 shows how couples' survival probabilities varied as a function of life satisfaction of both the female and male partners. A complex nonlinear interaction pattern emerged. When man's life satisfaction was high, higher woman's life satisfaction (almost) monotonically increased the union's chances of surviving. But when man's life satisfaction was low, the association between woman's life satisfaction and union survival was negative after a given threshold (about 7.5).
According to the last column of Table A.1 in Appendix C, the interaction between woman's percentage of housework and her working hours was also of potential interest. Figure 5 shows how woman's percentage of housework and her working hours interacted in predicting union survival. The relationship between a woman's percentage of housework and the couple's survival probability is almost linear and positive for low levels of working hours, while it was nonlinear for higher levels of the latter. Unlike in Figure 4, here the distance between the curves varies more along the value of the predictor on the x-axis (i.e., woman's percentage of housework). This suggests the presence of a substantial combined effect (interaction) of the two variables on union survival.
Figure 6 shows that the man's level of life satisfaction did slightly interact with woman's working hours in predicting the likelihood of union dissolution. The survival probability for both women who worked up to 36 hours and those who worked more than 36 hours was nonlinearly related to man's life satisfaction. The distance between the two curves tended to widen slightly as the man's life satisfaction increased, signaling that the latter may strengthen the negative association between number of hours women worked and the union's survival probability. Finally, we notice from Table A.3, as well as from additional partial dependence co-plots (available upon request), that none of the interactions between PTs of the same or different members of the couple appeared to be of substantial importance.
Revising the DT Model Using Insights From the RSF
The DT models estimated in the “Assessing the Algorithm's Performance” section could be modified in several ways. The models were not necessarily similar to those that would be commonly used in the framework of a traditional demographic study, in which the goal usually is to estimate the effects of specific independent variables to test a theory, net of a set of controls. Thus, we could select a smaller number of independent variables, which might reduce multicollinearity, though this is not an issue if prediction is the goal of the study. Also, the models might be made more flexible by allowing, for example, for specific nonlinearities. Finally, to improve accuracy, some variable selection approaches could be adopted. In the previous section we showed that RSF is able to pursue the goal of examining the complex pattern of links between a large set of predictors and union dissolution, while at the same time attaining a higher accuracy.
Here we consider a possible approach to improve the reference DT model shown in Table 3 (i.e., DT3) by leveraging insights from the RSF analysis. First, we excluded the variables showing a negative VIMP in Figure 1 (i.e., labor income of both partners, woman's percentage of labor income, labor income homogamy, woman's unemployment status, woman's extraversion and agreeableness, man's openness, number of children, and the dummy variables for age homogamy and for the woman being either older or richer than the man).
Second, we added a nonlinear term. Specifically, we added a square term for man's extraversion, which was among the seven most important variables, and whose impact on couple survival showed a general nonmonotonic nonlinearity (Figure 2). In particular, when man's level of extraversion was low but increasing, the probability that the couple broke up increased too, while the opposite was true at a medium-to-high level of man's extraversion.
The results of this revised version of model DT3 are presented in Tables A.4 and A.5 of the online appendix. We notice a general improvement of the model in terms of AIC (3,519), which is lower than in the original model (3,533). Additionally, the coefficient of the new quadratic term for man's extraversion was statistically significant, signaling the importance of considering nonlinearities for such a predictor. Of course, other nonlinear terms suggested by the partial dependence co-plot could also be added. Finally, the predictive power (out-of-sample) of the model slightly improved for all possible survival times, although it remained smaller than that of the RSF.
Discussion and Conclusion
The aim of this article was to contribute to the literature on union dissolution using an ML technique, and in particular RSF applied to longitudinal SOEP data from Germany.
Results from RSF indicate that the most important predictors of union dissolution are both partners' life satisfaction. Previous studies on union dissolution have considered specific dimensions of life satisfaction, such as satisfaction with relationship quality. Our study points to the potential role of overall life satisfaction as a proximate determinant of union dissolution. Although RSF cannot generally identify causal effects, our findings suggest that future studies on union dissolution could benefit from devoting more attention to general life satisfaction, or to satisfaction with more domains.
Relatedly, RSF also detected an interesting and complex nonlinear interaction between partners' life satisfaction, such that woman's life satisfaction was positively associated with union survival when the male partner's well-being was high. However, when the male partner's life satisfaction was low, a high level of woman's life satisfaction was negatively associated with union survival. As with findings on fertility behaviors (Aassve et al. 2016), this result points to the need to consider both partners' subjective well-being in studying union survival.
Our analyses based on RSF were able to account for both partners' PTs and to explore their interactions. Interestingly, we found that, despite the importance of some specific PTs, none of the interactions between partner's PTs were relevant for predicting union survival. This finding should be analyzed in more depth in future studies, to assess, for example, the extent to which individuals form unions by matching specific combinations of PTs and whether assortative mating based on PTs might explain the lack of interactions.
As for the role of specific PTs, we found that the male partner's extraversion had an impact on predictive accuracy and substantial associations with union survival probabilities. This is in line with the few studies on union dissolution that accounted for PTs (Bortien et al. 2015; Bortien and Mortelmans 2018). Additionally, we found that the impact of the man's extraversion on union survival probability showed a nonmonotonic nonlinearity, which substantially affected the predicted probability of union dissolution.
From a methodological point of view, our analyses also showed that traditional discrete-time event-history models have very poor out-of-sample predictive accuracy, not substantially different from random guessing, while the conventional in-sample values were considerably higher. This indicates that previous studies may have suffered from overfitting. ML may suffer from overfitting as well; however, apart from separating estimation from the evaluation of predictive accuracy, our analyses used RSF, which, being based on a multitude of trees, reduce risks of overfitting.
The RSF offered better predictive accuracy than DT models, confirming past research indicating that algorithmic approaches typically show better performance in this task. This is an important advantage of ML approaches because, as Hofman et al. (2017) argue, predictive accuracy is crucial for theory building and validation. However, the predictive accuracy of RSF remained limited, a finding that is in line with the results of a mass scientific collaboration that found low predictability in life outcomes (Salganik et al. 2020). In other words, union dissolution may be, in part, intrinsically “random” (Hofman et al. 2017), and thus, as in other social science contexts, we have to accept that a portion of phenomena might not be predictable. Additionally, studies have found that selection of variables based on theory or expert judgments does not necessarily improve predictive performance (Filippova et al. 2019). We used a large set of predictors identified in previous studies, but other factors that have not yet been studied might contribute to predicting union dissolution. Future applications of ML techniques could help identify these new predictors.
In conclusion, our results using RSF point to a complex interaction among partners' life satisfaction, and to the absence of interactions among partner's PTs, in predicting union dissolution. They also demonstrated which individuals' and couples' characteristics are highly predictive of union dissolution and which, instead, have weak or null predictive power.
A methodological contribution of this study has been to demonstrate the potential of ML techniques for the analysis of union dissolution and for demographic research more generally. We have shown that RSF are able to handle a large set of predictors, which proved very useful in the study of union dissolution. We also illustrated how RSF permitted exploring nonlinearities and nonadditivities in the links between predictors and union dissolution. Moreover, we found that RSF outperformed standard DT models in terms of out-of-sample predictive accuracy. Finally, we have demonstrated how insights from ML techniques can be integrated into more standard regression analyses.
Our implementation of RSF can be easily applied to different data and topics in demography by adapting the R code we provide. Cesare et al. (2018) noted that demographers have used some ML techniques in the analysis of digital trace data, such as for the (semi)automatic coding of tweets (Karamshuk et al. 2017). We showed how demographers might find ML techniques useful in the context of more “traditional” survey data too. Despite the promise of ML, however, our analyses demonstrate that artificial intelligence cannot replace human judgment entirely (Berk 2006; Lichtenthaler 2018). Machine learning is not a purely data-driven approach, just as regression-based research is not purely theory-driven (for further discussion, see Shu 2020). The use of ML techniques is not, as some have suggested, the “end of theory” (Anderson 2008). Key aspects of the definition, the measurement, and the selection of the variables to be analyzed remain with the researcher. Similarly, the interpretation of empirical results, and their contextualization within the broader literature and with respect to the study setting and the historical period, necessitate decisive human inputs. By combining subject-matter expertise with automatic and semiautomatic computational methods (see, e.g., Blei and Smyth 2017), demographers will be able to leverage the benefits of both the human and the machine.
Acknowledgments
The research leading to these results received funding from the European Research Council, under the European ERC Grant Agreement no. StG-313617 (SWELL-FER: Subjective Well-being and Fertility; principal investigator, Letizia Mencarini).
Notes
More formal and detailed discussions of ML techniques are available in the books by Berk (2016), Hastie et al. (2009), and James et al. (2013). These offer comprehensive and accessible accounts of the main ML techniques.
The algorithm grows a forest of 50 trees for each pair of parameters, and then calculates the OOB error rate associated with that forest. The contour plot graphically shows the level of OOB obtained from each of these calculations.