Responses to survey questions about abortion are affected by a wide range of factors, including stigma, fear, and cultural norms. However, we know little about how interviewers might affect responses to survey questions on abortion. The aim of this study is to assess how interviewers affect the probability of women reporting abortions in nationally representative household surveys: Demographic and Health Surveys (DHS). We use cross-classified random intercepts at the level of the interviewer and the sampling cluster in a Bayesian framework to analyze the impact of interviewers on the probability of reporting abortions in 22 DHS conducted worldwide. Household surveys are the only available data we can use to study the determinants and pathways of abortion in detail and in a representative manner. Our analyses are motivated by improving our understanding of the reliability of these data. Results show an interviewer effect accounting for between 0.2% and 50% of the variance in the odds of a woman reporting ever having had an abortion, after women's demographic characteristics are controlled for. In contrast, sampling cluster effects are much lower in magnitude. Our findings suggest the need for additional effort in assessing the causes of abortion underreporting in household surveys, including interviewers' skills and characteristics. This study also has important implications for improving the collection of other sensitive demographic data (e.g., gender-based violence and sexual health). Data quality of responses to sensitive questions could be improved with more attention to interviewers—their recruitment, training, and characteristics. Future analyses will need to account for the role of interviewer to more fully understand possible data biases.
Abortion data, particularly in low- and middle-income countries (LMICs), are of variable quality and coverage (Chae et al. 2017; Sedgh and Keogh 2019; Sedgh et al. 2012) and often contain little information about the characteristics of women who have induced abortions. Induced abortion is consistently underreported in nationally representative surveys, including in contexts where abortion is legally available (Jones and Forrest 1992; Jones and Kost 2007). Nationally representative household surveys have been used to describe the characteristics of women who have abortions, but abortion self-reports are unreliable. This results in the use of alternative, indirect estimation methods (abortion incidence complications method, best friend approach, confidant method, list experiment) (Sedgh and Keogh 2019), but these are not always based on nationally representative data or able to provide direct and reliable information on the individual- and household-level characteristics of women having abortions (Jones and Kost 2007; Sedgh et al. 2012). Although self-reported abortion data are known to be unreliable, they are at present our main source of nationally representative data on the characteristics of individuals who have sought abortion in Africa, Latin America, and Eastern Europe.
Survey item response on any topic is affected by a range of factors, including mode of interview (e.g., face-to-face interview vs. online interface), question wording and order, language used, presence of others during the interview, and interviewer's attitudes. Survey questions might elicit either a nonresponse or a response that may or may not be valid or reliable. Survey questions on sensitive topics are likely to elicit higher nonresponse rates or larger measurement error than nonsensitive questions (Tourangeau and Yan 2007). There is no standardized definition of a sensitive question, and Tourangeau et al. (2000) suggested that the concept of a sensitive question has three distinct meanings in the survey literature: intrusiveness, the threat of disclosure, and social undesirability. Typical sensitive topics, depending on context, might include drug use, sexual behaviors, voting, and income. Survey questions on abortion—particularly those relating to personal experience or behaviors (as opposed to opinions)—are considered sensitive in nearly every context (Tourangeau et al. 2000). Reluctance to respond may be due to either the stigma of reporting an abortion or the fear of repercussions for the respondent or someone close to them (Tourangeau and Yan 2007), particularly in settings where abortion is highly restricted or stigmatized. The interviewer effect refers to the influence that an interviewer's skills, beliefs, or personal characteristics have on respondents' responses to survey questions. Sensitive topics, including but not limited to questions on abortion, are likely to be particularly prone to interviewer effects.
Although the literature on stigma and abortion is growing (Levandowski et al. 2012; Lindberg and Scott 2018), we know very little about interviewer effects in the context of nationally representative household survey data in LMICs and nothing about the impact of interviewer effects on abortion survey data in LMICs. Current global evidence around abortion questions concentrates on interview mode, finding that it has a considerable impact on response quality. Face-to-face interviews (FTF) are more prone to underreporting of sensitive issues (Mensch et al. 2008). Evidence on abortion-related data collection from both high-income countries (HICs) and LMICs shows drawbacks and benefits for a wide range of methods (Rossier 2003). Self-administered survey modes, such as assisted self-interviewing (ACASI), are usually more reliable in reporting confidential information (Jones and Forrest 1992; Lindberg and Scott 2018; Rossier 2003; Sedgh and Keogh 2019; Sedgh et al. 2012). However, they are less suitable when respondents have no education or a low level of education. Although ACASI might make people feel more at ease in answering questions on abortion, thereby lowering the chance of underreporting (Hewitt 2002; Mensch et al. 2008; Sedgh and Keogh 2019), there is limited and context-specific evidence on the impact of ACASI with respect to abortion. The limited evidence shows that the interviewer effect remains whether the questionnaire is filled out by the interviewer or by the interviewee and that it is the fear of disclosure that matters (Tourangeau and Yan 2007). Research from Malawi showed mixed results when comparing FTF with ACASI (Mensch et al. 2008). Particularly in low-resource settings with low levels of education, FTF surveys remain a crucial survey mode, meaning that the potential impact of the interviewer on responses needs to be better understood.
Qualitative data have shown that interviewers can influence responses in many ways, including making decisions about who meets inclusion criteria for the survey, their power of persuasion in eliciting a response, and their relative socioeconomic position compared with the respondent (Biruk 2018; Randall et al. 2013). Quantitative evidence on interviewer effects relates to survey questions other than abortion. Weinreb and Sana (2009) used the 1998 Kenya DHS to analyze the effect of the interviewer's translation of the questionnaire, included random effects for the interviewer and district, and showed a clear interviewer effect in relation to questions on HIV and pregnancy. However, a recent study showed very little interviewer impact on nonresponse related to contraceptive use questions in the Philippines and Indonesia (Amos 2018). Quantitative studies have described interviewer effects in both HICs and LMICs (Becker et al. 1995; Bignami-Van Assche et al. 2003; Couper and Groves 1992; Flores-Macias and Lawson 2008). A study from the United States showed that survey respondents interviewed by more experienced interviewers were more likely to agree or strongly agree with attitude questions, regardless of the question (Olson and Bilgen 2011). In a study in Kenya and Malawi, Bignami-Van Assche et al. (2003:60) concluded that questionnaire translation is less important than “the selection, training, and supervision of interviewers,” underlining the importance of interviewers for collecting high-quality household survey data.
Interviewer effects are often deemed to be due to interviewers' different propensity for reproducing the influence of community-level stigma and cultural norms within the interview interaction (Couper and Groves 1992; Hox and De Leeuw 2002; Randall et al. 2013). Community effects are substantively but also methodologically important. Because interviewers operate within a given geographical area, including interviewer effects on their own could be picking up community effects if these were not separately included. Community factors must be accounted for to ensure that the interviewer effect is not simply a confounding factor for community effects. In addition, community effects may vary because of geographic differences in both underreporting and the true incidence of abortion.
There is little evidence on how community effects impact responses to sensitive questions. Studies on access to maternal and child healthcare that included community random effects or community-level variables showed that community effects are important (Gabrysch and Campbell 2009; Mahmud Khan et al. 2005). Most studies have used survey sampling clusters as a proxy for community effects and often use the terms “cluster” and “community” interchangeably (Koenig et al. 2003). In this article, we use “sampling cluster” to refer to our methods and describe results, and we use “community” to situate our findings against the literature and interpret results substantively.
Community effects partly reflect the clustering of norms around service use or around who makes decisions about healthcare access in the household. In addition, the community level is important to understand the availability, accessibility, and quality of services and health workers (Gabrysch and Campbell 2009; Mahmud Khan et al. 2005; Stephenson et al. 2006). In the case of abortion, the availability of services can differ widely within a given country, and access to abortion depends partially on health workers' attitudes toward abortion (Haaland et al. 2020; Nandagiri 2019). Understanding the relative impact of community effects and interviewer effects on abortion reporting allows us to identify the extent to which quality issues with survey reporting of abortion experiences are explained by factors beyond the survey's control, such as community-level stigma, and by factors that can be addressed through improvements in methodological quality.
Separation of community effects from interviewer effects can also help to explain whether reporting of abortion is related to issues with survey implementation or to actual differences in abortion rates, given that we would expect there to be a high level of variance in abortion rates between communities. Accounting for both community and interviewer effects can inform decisions about the value of including questions about abortion in household surveys as well as the measures needed to improve the quality of these data.
The aim of our study is to test the impact of interviewer effects on the probability of reporting having ever had an abortion in a DHS. This study also contributes to research aiming to improve data collection on sensitive issues (e.g., domestic violence, sexual practices) within a nationally representative survey. To our knowledge, no other study to date has included both the interviewer and community effects in the analysis of responses to a sensitive question in a nationally representative survey.
The DHS is a key nationally representative and internationally comparable household survey for LMICs that includes abortion questions in several countries. In the DHS, individual interviewers work as part of a team, both for training and fieldwork. They report to a team leader and usually attend the same training at national or regional level that follows standard guidelines (ICF 2020; ICF Macro 2009).
With an estimated 25 million unsafe abortions taking place annually, causing 4.7% to 13.2% of maternal deaths, accurate abortion estimates are vital (Ganatra et al. 2017). Abortion rates are often estimated indirectly. The widely validated abortion incidence complications method (AICM) relies on collecting data from women receiving post-abortion care, including health facility data, to indirectly estimate the incidence of abortion in countries where abortion is restricted or highly stigmatized (Sedgh and Keogh 2019). These are the most reliable estimates, and they bear little resemblance to rates calculated from DHS data. For example, Malawi had an indirectly estimated abortion ratio of 22 per 100 live births in 2015, compared with 0.6 per 100 live births calculated from DHS data collected in the same year. Pakistan's AICM study in 2012 estimated 17.5–21.6 abortions per 100 live births, compared with a DHS estimate of 1.7 per 100 live births. In Brazil, the AICM estimated 26.7–44.4 abortions per 100 live births in 1991 versus 2.5 per 100 from DHS data in the same year (Singh and Wulf 1994; Singh et al. 2017). Although the AICM is an effective method for estimating abortion rates, it is able to estimate only national or regional abortion prevalence and is silent on individual-, household-, or community-level determinants of seeking an abortion. Other approaches that use nationally representative community-based surveys (e.g., best friend approach, confidante method) rely on respondents' reporting of others, not themselves (Sedgh and Keogh 2019). Indirect data can be used to collect demographic- and abortion-related characteristics information, but it also presents an inherent risk of bias as well as a more limited scope to collect a wide range of characteristics. Therefore, understanding how to improve the collection of abortion survey data that includes individual- and household-level characteristics is particularly important.
DHS evidence on abortion has often been disregarded because of its low quality, leaving a persistent gap in abortion information from LMICs (MacQuarrie et al. 2018; Polis et al. 2017; Sedgh and Keogh 2019) and resulting in a wealth of unused data, with only a few studies attempting to analyze them (Bradley et al. 2019; Chae et al. 2017) or using them to benchmark a minimum abortion rate (Sedgh et al. 2016). The low quality has been attributed to abortion being legally restricted in many settings, combined with high levels of abortion-related stigma, leading to low levels of disclosure about ever having had an abortion. The literature has suggested modifications of the abortion questions, such as using the confidant method (asking the woman to report about a friend); asking about miscarriages and stillbirths separately as well as about total terminations in order to indirectly estimate induced abortions; or using the list experiment approach, with abortions included within a wider list of events that a woman might have experienced (Rossier 2003; Sedgh and Keogh 2019; Sedgh et al. 2012).
The DHS program has little metadata available on the level and quality of interviewer training. A review of available evidence from the DHS along with personal communications with experts at the DHS and the Guttmacher Institute demonstrates a general lack of information about the content and quality of training given to interviewers on abortion data collection beyond what is reported in the general DHS trainers' manuals. The DHS manuals include in-depth detail only on the mechanics of data collection, mainly related to calendar data, and do not focus on the social aspects of interviewing. We also know that that one day of two to three weeks of training is used to train interviewers on collecting birth histories, which include pregnancy terminations (ICF 2020; ICF Macro 2009).
Starting from the premise that DHS data on abortion are of poor quality, we hypothesize that the interviewer influences the quality of responses. The ultimate objective is to establish whether in addition to the efforts being made to modify sensitive questions, more effort should be invested into understanding and mitigating interviewer effects.
Our study sample comprises all DHS surveys that included a question on either ever having had an induced abortion (online appendix, Table A1) or the number of lifetime induced abortions. See Table 1 for the reproductive health indicators for these 22 countries included in this study. Selected countries have a range of legal frameworks in place at the time of the DHS survey: legal only to save a woman's life (Haiti, Malawi, Congo, Côte d'Ivoire, and Madagascar); legal to save a woman's life and in cases of rape (Mali); and legal to save a woman's life, in cases of rape, to preserve the woman's physical and/or mental health (Cameroon), or in cases of fetal impairment (Colombia, Ghana, and Gabon). Less restrictive settings also include provisions for socioeconomic issues (India), whereas other countries have legalized abortion on request (Albania, Azerbaijan, Armenia, Cambodia, Moldova, Kyrgyz Republic, Kazakhstan, Tajikistan, Turkey, Ukraine, and Vietnam) (Centre for Reproductive Rights 2020; U.N. Population Division 2014).
Data and Methods
To select the countries for analysis, we first review all DHS data sets that have ever asked any questions on induced abortion (n = 101).1 Of these, we analyze the most recent survey for those that included a question about ever having had an abortion and/or the lifetime number of abortions, had data available, and had no comparability issues with the data set (e.g., data for Liberia were collected among only young women, and India’s latest NFHS asked about abortions in the last five years only, and thus the data were not comparable). We also exclude countries that asked about abortions only in the last five years because this is not comparable with questions collecting abortion data over the lifetime and was asked by only two countries (Bangladesh and Indonesia). We include 22 countries: 8 from sub-Saharan Africa (SSA), 9 from Europe, 3 from Asia, and 2 from Latin America (LA).2
Our review reveals a high level of heterogeneity in abortion-related question wording and sequencing within DHS surveys, highlighting the challenge of cross-country comparative analysis. In contrast to the homogeneity of typical DHS data (e.g., V201 is widely recognized as the code for the parity variable), we also find substantial diversity in DHS abortion question codes.
The surveys included in this study used four key sets of questions with various phrasing (Table 2 and the online appendix, Table A1). A group of countries asked a question about any termination (i.e., miscarriage, stillbirth, or induced abortion) followed by a question about number of induced abortions (Cameroon, Congo, and Malawi). A second group asked about any termination, and then asked about any induced abortion (Côte d'Ivoire, Gabon, Mali, and Vietnam). A third group asked about any induced abortion, and then asked about the number of induced abortions (Armenia, Azerbaijan, Cambodia, Kazakhstan, Kyrgyz Republic, Moldova, Tajikistan, Turkey, and Ukraine). A fourth group asked about any termination followed by any induced abortion, and then asked about the number of induced abortions (Cambodia, Haiti, and India). Ghana is the only country that asked about the number of induced abortions directly, and then recorded the timing of abortions using the reproductive calendar. Finally, one country (Madagascar) asked only about ever having an abortion. In each of the 22 surveys, we used multilevel multivariable logistic (or ordinary least squares [OLS]) regressions in a Bayesian framework to test the impact of the interviewer on women's responses to three questions: ever having had a termination; ever having an induced abortion; and, conditional on ever having had an induced abortion, the lifetime number of induced abortions. We model each country separately because a pooled analysis would have been computationally intractable.
We test the interviewer effect across different questions within the same survey to understand whether the interviewer effect tends to be stronger for more sensitive questions. For example, one would expect that the question about any termination would be less sensitive than the induced abortion question. However, when a woman answered the question about any induced abortion or provided a nonzero response to the number of abortions over her lifetime, it is unclear whether the number of abortions is an reliable measure. Given the diverse formulation of questions, we cannot produce a definitive comparison between different question sequences (e.g., by asking directly about abortion vs. by asking about all terminations first). We do, however, suggest possible reasons for the results we obtain.
We test the interviewer effect using (primarily) logistic models, estimated according to Eq. (1). Here, y is the outcome of interest for the ith respondent, interviewed by the jth interviewer in the kth sampling cluster. θ is a vector of individual-level demographic characteristics, and ς represents the crossed-classified interviewer and cluster-level random intercepts. The random intercepts are assumed to be normally distributed around mean β0 according to, respectively, variance σj and σk (conditional on covariates) and are assumed to be independent from each other and across interviewers and clusters.
An equivalent equation is estimated using OLS for the continuous outcome of number of abortions. OLS models are preferred over Poisson models for the continuous outcome because it is not possible to estimate the intraclass correlation coefficient (ICC) for Poisson models in a comparable way with OLS models (Rabe-Hesketh and Skrondal 2008). Regardless, point estimates of the variance of interviewer and cluster-level random effects in a sensitivity analysis employing Poisson models are within credible intervals of the OLS models3 (see the online appendix).
Because of the binomial logistic crossed-classified random-effects specification, frequentist estimation of the variance components using Laplacian approximation with one integration point is highly biased (Rabe-Hesketh and Skrondal 2008). We therefore use Markov chain Monte Carlo estimation in a Bayesian framework—specifically, a Gibbs sampler (JAGS, implemented using the R package rjags)—to generate our results. We use noninformative priors, 5,000 iteration burn-in and 5,000 saved posterior samples for the logistic models. The OLS models are run for 80,000 samples. Chains with different starting points give similar results, and the Raftery-Lewis diagnostic indicates an appropriate number of burn-in and saved samples to obtain results with a 0.05 margin of error and 95% accuracy (Lunn et al. 2012).
In total, we analyze data from 344,623 women aged 15–49. The data sets report information on sampling clusters/primary sampling units (used as a proxy for community) and interviewers, but they do not usually report on interviewers' characteristics. The inclusion of interviewers' characteristics is an innovation included only toward the end of the most recent DHS phase (Phase 7, 2015–2018).4 A description of the sampling clusters and number of interviewers is reported in Table 2.
The multilevel approach has cross-classified random intercepts at the level of the sampling cluster and the level of the interviewer. In this way, we can simultaneously consider the amount of variance in the outcome (e.g., ever had an abortion) associated with different interviewers and the amount of variance associated with different communities, while controlling for the respondent's demographic characteristics. Cross-classified random effects are used because interviewers and clusters are not nested within one another. Assuming that interviewers were equally likely to be assigned to respondents with a low or high probability of reporting an abortion, the variance in the average odds of reporting an abortion by interviewer should be close to zero if all respondents reported a valid response or if all interviewers had a similar capacity to elicit valid responses. Under the same assumption, a large variance indicates that interviewers are differentially likely to elicit valid responses. The ICC calculates the share of the total variance made up by the different components—for example, the share of the variance accounted for by the interviewer variance relative to the sum of the respondent-level variance and the cluster-level variance, conditional on covariates. The ICC allows us to compare the magnitude of the interviewers' random intercept variance across questions and settings. In the Results section, we refer to the ICC for the interviewer random effect as simply the “interviewer effect.” In logistic models, the respondent-level variance is fixed at 3.29, whereas it is directly estimated in the OLS models.
The inclusion of a cross-classified random intercept at the cluster level allows us to control for the fact that some interviewers might be assigned to communities with a different probability of reporting an abortion (Vassallo et al. 2017). Community effects could capture factors such as culture, religion, and stigma. We also control for region in the main effects such that the cluster-level variance is estimated conditional on regional factors. Previous research in Bangladesh has shown the importance of accounting for location when looking at sensitive questions in survey data (Koenig et al. 2003). The sampling cluster is also an indication of where abortion-related services might be available and of the extent to which the community knows about them.
To implement our method, we must be able to accurately distinguish between cluster-level variance and interviewer-level variance, which requires sufficient interpenetration of the two levels. Vassallo et al. (2017) suggested that having three areas per interviewer provides sufficient interpenetration. A description of the sampling clusters is reported in Table 2. In all countries, the majority of interviewers worked in three clusters or more, providing a robust level of interpenetration.
In addition to testing for cluster and interviewer effects, we control for respondents' characteristics to control for the possibility that some interviewers may have been systematically assigned to women more (or less) likely to report an abortion. We would expect younger, nonmarried, poorer, and less-educated women to be more reticent in answering abortion questions because abortion stigma is more evident across groups of women of low socioeconomic status (Jones and Forrest 1992; Lindberg and Scott 2018; Tourangeau and Yan 2007). However, previous studies have tested only the likelihood of responding given women's characteristics; less clear is how women's characteristics have an impact on the outcome once interviewer effects are accounted for. Amos (2018) modeled nonresponse to a question about reasons for not using contraception by including interviewer random effects and respondent-level covariates. Although less-educated and poorer women were less likely to respond, the study did not find a strong interviewer effect (Amos 2018).
The respondent characteristics included in our study are age, marital status (Turkey, India, and Vietnam interviewed only married women), rural-urban residence, geographic region, education, and wealth. Wealth was calculated using principal component analysis of asset indicators at the household level weighted separately by rural and urban areas (Filmer and Pritchett 2001). All variables are included in the model to ensure that the models are comparable across countries.
Individual Interviewers or Communities?
We estimate the models with interviewer and sampling cluster random effects with respect to three different questions (Table 3). In the first set of models, the outcome is “ever having had a termination” (i.e., any abortion, miscarriage, or stillbirth). In these models, the interviewer effect is stronger than the sampling cluster effect, with two exceptions: (1) Armenia and Vietnam, where the sampling cluster effect is stronger, and (2) Cameroon, where the variance is the same (3%). Mali has the strongest interviewer effect (26%), and Turkey has the lowest (0.8%). Countries in SSA have stronger and more similar levels of interviewer effects between countries (average 19.75%) compared with countries in Eastern Europe and Central Asia (average 9.1%), which are former communist countries with a history of less restrictive abortion laws.
In the second set of models, for the question “ever having had an abortion,”5 the interviewer effect is stronger than in the “ever having had a termination” models in the majority of countries. This suggests that the interviewer effect is stronger for more sensitive questions. Again, the interviewer effect is generally stronger than the sampling cluster effect; exceptions are for Armenia, Azerbaijan, Moldova, Turkey, Cambodia, Vietnam, and Colombia. The interviewer effect for “ever having had an abortion” ranges from 50% in Mali to 0.15% in Turkey (see Figure 1 and Table 3). On average, the interviewer effect is strongest in SSA (19%). However, the 95% credible intervals for the variance of the interviewer random effects are quite wide and are wider the larger the variance (Figure 1).
The last set of models analyzes the question on the number of abortions over a woman's lifetime. For this question, which we recode as a number greater than 0 (and missing if equal to 0), the interviewer effect is much smaller. This aligns with our expectations given that reporting ever having had an abortion (the first question) is more sensitive and therefore more prone to interviewer effects (i.e., initial disclosure) compared with subsequent reporting of the number of abortions (although we would also expect this to be underreported). Across all countries, both the interviewer and sampling cluster effects for this question are very small, suggesting that there may be less variability across interviewers in the number of abortions a woman reports, conditional on her reporting an abortion. In this set of models, more countries reported a higher level of variance for the interviewer effect than for the sampling cluster effect. This could be due to the smaller size of the interviewer effect and/or to communities' differential availability of abortion-related services, which is more likely to affect the number of abortions accessed.
Patterns According to Question Sequencing and Prevalence and Legal Status of Abortion
We consider possible patterns according to the sequencing of the questions (e.g., any terminations followed by any abortion versus number of abortions). We conclude that the interviewer effect tends to be stronger than the sampling cluster effect regardless of the sequencing of the questions. This finding holds even for countries such as Ghana, where a question about the number of abortions was asked directly after asking about any terminations.
In Ghana, where the data come from a special DHS maternal health survey with a more complex set of questions, the interviewer effect is lowest in Africa, and the percentage of women reporting having had an abortion is so high (14.8% in Table 2) that it would appear that underreporting could be lower. A recent study reported an AICM estimate of the abortion rate of 26.8 per 1,000 women aged 15–49 in comparison (Polis et al. 2020), perhaps because including abortion questions within a maternal health survey reduces stigma by signaling that abortion services are part of reproductive healthcare. Alternatively, higher-quality interviewer training might have reduced the bias.
We find no discernible difference in the interviewer and sampling cluster effects across different levels of legal status of abortion at the time of the survey (Table 1). For example, across all the European/central Asian countries with the most liberal laws on abortion, there is a wide range of variance at both the interviewer (from 0.2% in Turkey to 19.4% in Albania) and the sampling cluster level (from 0.3% in Turkey to 5.64% in Tajikistan) within the models on the question, “Have you ever had an abortion?” This finding is in line with evidence that abortion stigma is present irrespective of legality (Coast et al. 2018).
We report the coefficients on respondents' characteristics for ever having had an abortion in the appendix (Table A2, online appendix), although we include these control variables for all three questions. We find a positive impact of age on the probability of reporting having an abortion, except in Colombia, Cambodia, Moldova, and Albania, where the relationship is negative. This finding is to be expected given that older women are more likely to feel confident in reporting an abortion and would also have been more exposed to the need for an abortion over their lifetime (Jones and Forrest 1992; Jones and Kost 2007). For each country, where the coefficient for urban is positive and the Bayesian credible intervals around the coefficient do not include 0, our model estimates greater than a 97.5% probability that women in rural areas are less likely to report ever having had an abortion. Wealth shows a positive gradient, with wealthier women in 13 of our 22 study countries (Armenia, Kyrgyz Republic, Tajikistan, Ukraine, Cameroon, Madagascar, Ghana, India, Malawi, Congo, Kazakhstan, Moldova, and Haiti) being the most likely to report ever having had an abortion. For the remaining countries, the credible intervals of the estimate include 0 for most of the coefficients. This pattern is to be expected given the substantial administrative, cognitive, and financial barriers to procuring an abortion, especially—but not only—in places where it is legally restricted (Sedgh et al. 2012).
In all SSA countries except for Malawi, having at least a primary education is associated with a higher probability of reporting ever having had an abortion. However, this relationship is less clear in other world regions. Marital status also shows mixed results, although ever having been in a union and currently being in a union are typically associated with a higher probability of reporting. Across all models (where available), having never been in a union shows a lower reporting of ever having had an abortion.
Overall, we find a higher probability of reporting ever having had an abortion among urban, wealthier, older, and more-educated women.
To inform strategies, policies, and program to reduce unsafe abortion, it is critical to improve the availability of representative data on abortion incidence by subgroup, including abortion methods, sources, safety, and experiences. Similar data about contraceptive use have been crucial for programs aiming to reduce unmet need for contraception. However, household survey data on abortion have been lacking or underused because of concerns about quality and underreporting, and insufficient efforts have gone into assessing the quality of abortion survey data. The results of this study show a clear interviewer effect on responses to abortion questions. The effect is strongest for the question on ever having had an abortion relative to the question on the number of abortions. Interviewer effects are typically stronger than community effects, which proxy for potentially different community levels of stigma and context as well as service availability. The fact that community effects are weaker could also be attributed to the fact that conditional on regions and place of residence (rural vs. urban), which are controlled for in the main effects, communities are more homogeneous.
However, this study cannot estimate the magnitude of the interviewer effect in absolute terms, nor can it estimate how much better the data would be had the interviewer effect not existed. This study cannot determine whether a low interviewer ICC is evidence of the abortion question having high validity for those surveys because it is possible that all interviewers homogeneously yet negatively affected the validity of the response. However, we believe that this is unlikely, given evidence from our study and others showing that less sensitive questions have lower interviewer variance, and we interpret low variance across interviewers as a sign of high validity.
The findings further show that the magnitude of the interviewer effect is sensitive to the question asked, with questions on any termination and number of abortions showing a smaller interviewer effect. In a further analysis, we choose a random subsample of countries (n = 5) from our sample of 22 and run the same model on less sensitive questions, such as current use of contraception and number of children. The results show a considerably lower variance at interviewer level and a more prominent effect at community level (results not shown here).
Previous studies looking at interviewer effects highlighted that a sense of being judged is a key barrier to answering sensitive questions (Durrant et al. 2010; Randall et al. 2013). Gender, social status, and age are also key characteristics that are sensitive in context-specific ways and could lead to a bias in responses (Becker et al. 1995; Singer et al. 1983)
Running the models with a cross-classified random effect at the community level gives us a greater sense of the significance of the interviewer effect. To further test this, we first run separate models with random effects at the sampling cluster level only for the “ever having had an abortion” outcome to gauge the impact that geographically specific culture, stigma, and/or abortion services availability might have on the willingness to report an abortion. The community effect is much larger when interviewer effects are not included, which implies that including only sampling cluster-level random effects incorrectly picks up interviewer effects. The share of variance accounted for by the community varied from 32.4% (Mali) to 0.5% (Turkey) (results not shown). We cannot ascertain any relationship between the level of liberalization of the abortion law and the interviewer effect or between different question sequencing patterns (e.g., simply asking the number of abortions vs. three separate questions). This would need to be further analyzed in future research to attempt to dampen the influence of legal status of abortion on the likelihood of a valid response to abortion questions. However, we would expect that even in fairly liberal contexts, stigma and lack of knowledge on legislation would still be a barrier to a valid answer (Rossier 2003).
In accordance with previous literature that did not account for interviewer effects (Chae et al. 2017), our study shows that women from a poorer background, with a lower level of education, and from rural areas are less likely to report ever having had an abortion. This, in addition to the positive correlation with age, could be a combination of both being less likely to have accessed an abortion and being more afraid or ashamed to report one.
Notwithstanding the robustness of the analysis, this study has limitations. First, question wording and sequencing varied across surveys. We cannot test the impact that the phrasing, the order, and type (e.g., ever having had an induced abortion vs. asking directly the number) of the questions might have on the probability of responding to the question. We descriptively address this issue by looking at overall patterns of variance given a set of questions. We cannot assess whether larger interviewer effects are due to the context, the phrasing of questions, how sensitive the questions are in that context, or to the quality of interviewer training and supervision across countries, which is unobserved.
In addition, by considering the three different set of questions, we include all possible variations around the sequencing of the questioning, and we exclude those countries or questions that would not yield a comparable estimate of abortion incidence.
This study highlights the importance of providing better training and supervision to interviewers when collecting abortion data and more generally sensitive data in household surveys to improve data quality. It also demonstrates that despite an attempt at using standardized training tools, the impact of individual interviewers is credibly greater than that of communities after region and rural/urban residence are controlled for. Because abortion stigma and shame are locally (re-)produced, one of the key issues might yet be the standardization of questionnaires and training. However, we are not able to show differences in the scale of interviewer effects across different levels of legalization of abortion, nor can we tease out which wording or typology of questions might yield more validity. Ghana's abortion module, which is the most detailed of the group of countries analyzed, shows some improvement in the quality of the data, but the module is lengthy and costly to administer. The Ghana results might have been due to the questions being asked within a wider maternal health survey, which could have made the reporting of induced abortion less stigmatizing as well as possibly reflecting greater emphasis on training and selecting interviewers. A recent estimation of the abortion rate in Ghana made using AICM sets the national level at 26.8 abortions per 1,000 women versus 14.8 in the DHS (Polis et al. 2020). This estimate is possibly one of the DHS estimates that is closest to the real value, although it is still likely to be underestimated. Our design cannot assess the counterfactual: what the reporting of abortions would be without an interviewer effect.
Given widespread lack of trust in DHS abortion data, should abortion questions be excluded from interviewer-led household sample surveys, such as the DHS? Or should greater efforts be made to recruit and train interviewers? Our analyses cannot answer all these questions, but they point to the need for a more careful understanding of the value of asking sensitive questions in general and abortion more specifically, as well as the impact that improved questions and/or interviewer training might have on the quality of abortion data from household sample surveys. The DHS is always under pressure to include additional questions, both for existing and new topics (Kishor 2015); these pressures must be balanced against survey length and costs. Questions that yield low-quality data and/or are not fully exploited might become vulnerable to future exclusion. DHS data provide crucial evidence about abortion, and better understanding of the interaction between demographic and behavioral determinants will improve our understanding of abortion. Better understanding of the social interaction between interviewers and respondents could be key to improving abortion reporting as well as reporting on other sensitive questions.
The DHS remains a valuable source of information on reproductive histories. It is the most complete household survey capable of showing linkages between reproductive histories, socioeconomic status, and abortion. Removing abortion questions would present a partial understanding of sexual and reproductive health realities and could signal that abortion is not worthy of measurement and understanding. Countries have invested considerable resources running DHS surveys, and we suggest that additional attention would leverage these investments to generate better abortion data.
Interviewer effects have been neglected in the quest to improve abortion data, with most attention given to question wording and survey mode. Even where they have been acknowledged to have an impact (MacQuarrie et al. 2018), they have not been analyzed in depth as this study does. Studies of interviewer effects should be further extended by including interviewer characteristics in the analysis, across other types of surveys asking sensitive questions in both high- and low-and-middle-income countries. This would enable us to test whether interviewers' attitudes, demographics, and social status impact the validity of responses and would allow differentiation between the interviewer's demographic characteristics hypothesis and the interviewer's skills hypothesis. The DHS has recently included this information in its latest rounds, making this the right time to better understand interviewer effects.
We also need more qualitative information on how to improve survey responses and reduce potential interviewer effects. Given the high cost of collecting household survey data such as the DHS, there is an urgent need to further investigate the suitability of current modules. Cognitive interviewing could elicit better understanding of how women interpret and answer terms (e.g., abortion, miscarriage, stillbirth) and the suitability of existing question wording and sequencing. Much research has explored indirect techniques to gather abortion information from surveys, such as the Guttmacher Institute's work on ACASI (Lindberg and Scott 2018). Although no consensus has been reached on the best mode of interviewing, national surveys could provide an excellent testing ground. More localized efforts to test abortion questions led by initiatives such as the Performance Monitoring and Accountability 2020 project (PMA2020) could also inform the way forward for larger data collection exercises like the DHS (Bell et al. 2019). Interviewer effects should also be tested in the context of these alternative data collection exercises.
Interviewer effects on abortion survey data have not been previously identified and need to be included in quantitative studies as a further quality check. Although we do not identify patterns in relation to macro factors such as abortion legislation and type of question, our study indicates the need for more methodological work to identify such possible influences. Our findings suggest a substantial interviewer impact on the probability of reporting an abortion, highlighting the need for greater awareness of the impact of interviewers on data outcomes, in particular—but not only—when questions involve sensitive or stigmatized topics. If the interviewer effect holds for other sensitive questions, there is an opportunity for broader improvements in data quality from interviewer-administered household survey data.
We would like to acknowledge the funding of the LSE Department of International Development's School Research Incentive Fund. We thank the incredibly useful feedback of the LSE Global Health Initiative reading group, and in particular of Dr. Rishita Nandagiri.
We started from the DHS user forum discussing abortion questions, where a complete list had been posted (https://userforum.dhsprogram.com/index.php?t=msg&goto=17456&S=Google).
For a complete list, see Table A1 in the online appendix.
Credible interval is a term used in Bayesian inference to define an interval within which a parameter falls with a certain probability, conditional on the data and model (http://mc-stan.org/rstanarm/reference/posterior_interval.stanreg.html). Credible intervals are reported in Tables 2 and 3, and in Table A2 of the online appendix. The equivalent in frequentist statistics is a confidence interval, but although they are both used for statistical inference, the interpretation differs.
This information was obtained from an informal conversation with DHS staff in December 2018.
In some countries, this question is directly asked as number of abortions, but here the variable is recoded as 0 or 1 if the woman had one or more abortions (Congo, Cameroon, Gabon, Malawi, Cambodia, Kazakhstan, Moldova, Tajikistan, and Ukraine).