## Abstract

Demographers and other social scientists often study effect heterogeneity (defined here as differences in outcome–predictor associations across groups defined by the values of a third variable) to understand how inequalities evolve between groups or how groups differentially benefit from treatments. Yet answering the question “Is the effect larger in group A or group B?” is surprisingly difficult. In fact, the answer sometimes reverses across scales. For example, researchers might conclude that the effect of education on mortality is larger among women than among men if they quantify education's effect on an odds-ratio scale, but their conclusion might flip (to indicate a larger effect among men) if they instead quantify education's effect on a percentage-point scale. We illuminate this *flipped-signs phenomenon* in the context of nonlinear probability models, which were used in about one third of articles published in *Demography* in 2018–2019. Although methodologists are aware that flipped signs can occur, applied researchers have not integrated this insight into their work. We provide formal inequalities that researchers can use to easily determine if flipped signs are a problem in their own applications. We also share practical tips to help researchers handle flipped signs and, thus, generate clear and substantively correct descriptions of effect heterogeneity. Our findings advance researchers' ability to accurately characterize population variation.

## Introduction

Demographers and other social scientists often study how the association between an outcome variable and a predictor variable differs across demographic groups, geographic areas, time points, or values of any other third variable (e.g., Brand and Davis 2011; Cui et al. 2019; Song 2016). They typically capture differential associations by including product terms in their regression models or by running separate models for each group. Differential associations—or what is often called *effect heterogeneity* (particularly in causal analyses, although in this article we use the term *effect* as shorthand even for noncausal associations)^{1}—are useful to understand because they shed light on group inequalities and how they change. For example, researchers may wonder whether differences in mortality (an outcome) between education groups (a predictor) are increasing over time (a third variable, across which the outcome–predictor association may differ) (Montez et al. 2019). Understanding effect heterogeneity is also useful for targeting interventions toward groups that are most likely to benefit (Manksi 2007). For example, public health officials might wonder whether vaccine promotion campaigns will increase vaccination rates more if they target parents of young children or older adults (Regan et al. 2017). Unfortunately, answering the question “Is the effect larger in group A or group B?” is surprisingly difficult. In fact, the answer sometimes depends on the scale used to summarize the effect: an outcome–predictor association may appear larger in group A than group B on one scale but smaller on another (Brumback and Berg 2008; VanderWeele 2019). For example, group differences in effects may look positive in odds ratios (larger in group A than B) but negative in average marginal effects or risk ratios (smaller in group A than B). We call this the *flipped-signs phenomenon*.

The flipped-signs phenomenon is known to methodologists, but this knowledge has failed to influence the practice of applied researchers. Many make unqualified statements like “the effect is larger in group A than group B” without acknowledging that this group difference may depend on the scale used to summarize the effect, without investigating whether it flips signs across scales in their applications, and without providing the information necessary for readers to complete this investigation (Knol et al. 2009; VanderWeele 2019). These practices are especially problematic in the context of nonlinear probability models (NLPMs), in which the scales of the outcome and the coefficients differ (and, thus, coefficient differences across groups do not equal differences in marginal effects). NLPMs include models for discrete dependent variables that relate the outcome's mean to predictors via nonlinear transformation, such as binary and multinomial logit and probit models, fractional logit models, ordinal logit models, and Poisson models (Breen et al. 2018:40).^{2} Almost one third of *Demography*'s original research articles published in 2018–2019 employ NLPMs, and about half of these investigate effect heterogeneity (Figure 1).^{3} Demographers are often at risk of encountering flipped signs. However, many will be surprised to learn that their conclusions about whether their effects are larger in one group than another can *reverse* across scales.

Applied researchers may anticipate magnitude differences across scales without sign reversals. Frequently used textbooks do not even mention the flipped-signs phenomenon. For example, Agresti (2002:183) states that “significant interaction can occur on one scale when there is none on another scale” but provides examples of only magnitude differences across scales, not sign differences (e.g., Agresti 2002:463). Wooldridge (2002:465) states that the coefficients “$\beta ^j$ [from probit or logit models] give the signs of the partial effects of each *X _{j}* on the response probability” even though the $\beta ^j$ do not equal the partial effects. Thus, readers may assume (incorrectly) that probit or logit coefficients on product terms between predictor variable

*X*and group variable

_{j}*G*always give the signs of the group differences in partial effects.

^{4}Methodologists' knowledge of the flipped-signs phenomenon has not sufficiently influenced applied research—and researchers remain in danger of making inaccurate claims about effect heterogeneity.

We make two contributions to enhance applied researchers' ability to describe effect heterogeneity in the context of NLPMs. First, we provide inequalities for researchers to check if flipped signs occur in their studies. Although these inequalities encapsulate simple mathematical facts, they have not been previously published (to our knowledge). Researchers use a variety of scales to quantify effect heterogeneity in NLPMs.^{5} Our inequalities reveal when the sign of effect heterogeneity flips across (1) the odds-ratio scale versus the probability scale and (2) the odds-ratio scale versus the risk-ratio scale.^{6} The odds-ratio versus risk-ratio flip may particularly surprise applied researchers, given that they often misinterpret odds ratios as risk ratios (Grimes and Schulz 2008).

Second, we provide concrete advice for researchers investigating effect heterogeneity. We reveal when flipped signs are most likely to occur and we discuss how researchers can handle them. Flipped signs occur frequently enough in real applications that researchers need this knowledge—particularly demographers employing discrete-time event-history models. In such models, outcome probabilities within each time unit are typically very small, and flipped signs often occur in low-probability circumstances.

We expand prior work by specifying when flipped signs occur in the context of NLPMs and by outlining best practices for applied researchers. The two studies closest to ours consider the necessary and sufficient conditions for flipped signs in a stylized case that consists of four probabilities: the probabilities of experiencing the outcome event with and without a binary treatment, compared across two demographic groups (Brumback and Berg 2008; VanderWeele 2019). We provide results in the more general framework of NLPMs, which are workhorses for demographers. We also allow for multivalued predictors and we illustrate the importance of additional covariates (rather than considering only a single binary treatment predictor). We use our analytic results to equip researchers with simple sets of inequalities (among NLPM coefficients and groups' outcome probabilities) that they can use to determine whether flipped signs are an issue in their applications. We further employ empirical and numeric analyses to provide researchers with practical tips for quantifying and interpreting effect heterogeneity on different scales.^{7} Together, our results advance researchers' ability to accurately characterize population variation.

## Inequalities for Determining When Flipped Signs Occur

We begin by introducing the simple inequalities that reveal when flipped signs occur between the odds-ratio scale versus the probability scale. We consider the context of a binary logistic regression model that includes a product term to capture effect heterogeneity on the odds-ratio scale. We defer step-by-step derivations to the online appendix, where we also discuss several extensions (including group-stratified logit models, probit models, and multinomial models). After providing the inequalities for odds ratios versus probabilities, we provide them for odds ratios versus risk ratios.

### Odds Ratios Versus Probabilities

*Y*= 1 if the event occurred for individual

_{i}*i*and 0 otherwise; denote Pr(

*Y*= 1) =

_{i}*p*. The odds of the event for person i is $pi1\u2212pi.$ The logistic regression models the log of these odds. $Xi=xi$ is the key predictor whose effect we would like to assess. $Gi=gi$ is an indicator variable distinguishing the groups whose effect heterogeneity we would like to assess. We consider a binary $G$, but the logic extends to comparisons across more than two groups. $Zi=zi$ is a vector of additional predictors (including the intercept, for notational ease). The model is

_{i}$\beta 3$ captures effect heterogeneity on the log-odds scale (i.e., the difference measured in log odds between people in group $G=1$ versus $G=0$ in $X$'s predictive effect on $Y$); $e\beta 3$ captures it on the odds-ratio scale. This effect heterogeneity's sign is the same in log odds and odds ratios; when $\beta 3$ is negative (positive) [zero], $e\beta 3$ is less than (greater than) [equal to] one, which indicates that $X$'s effect is more negative (more positive) [the same] in group $G=1$ than in group $G=0$. A more negative (more positive) effect is a larger effect when the effect in group $G=0$, $\beta 2$, is negative (positive). Thus, to determine when effect heterogeneity flips signs across the odds-ratio scale versus the probability scale, it is sufficient to determine when it flips across the log-odds scale versus the probability scale (i.e., when the sign of $\beta 3$ does not match the sign of the group difference in effects on the probability scale).

^{8}The group difference in $X$'s effect on the probability scale is

Clearly, then, the effect heterogeneity on the probability scale, $AME1\u2212AME0$, is a function of both the effect heterogeneity on the log-odds scale, $\beta 3$, and the group difference in typical event probabilities. Indeed, even if $\beta 3=0$, $AME1$ and $AME0$ will differ if the typical event probabilities differ across groups.

Working through the logic in greater detail:

If $\beta 3>0$ (indicating a more positive effect of $X$ on the odds-ratio scale in group 1 than in group 0),

^{9}then:(a) We will not observe flipped signs on the probability scale (i.e., we will observe a more positive effect of

*X*on the probability scale in group 1 than in group 0; the odds-ratio scale and the probability scale will tell the same story of effect heterogeneity) if $(\beta 2+\beta 3)p\u02dc1\u2212\beta 2p\u02dc0>0$, which happens when $p\u02dc0<p\u02dc1[1+\beta 3\beta 2].$(b) We will observe flipped signs on the probability scale (i.e., we will observe a more positive effect of $X$ on the probability scale in group 0 than in group 1) if $p\u02dc0>p\u02dc1[1+\beta 3\beta 2]$.

(c) We will observe no effect heterogeneity on the probability scale if $p\u02dc0=p\u02dc1[1+\beta 3\beta 2]$.

If $\beta 3<0$ (indicating a less positive effect of $X$ on the odds-ratio scale in group 1 than in group 0),

^{10}then:(a) We will not observe flipped signs on the probability scale if $p\u02dc0>p\u02dc1[1+\beta 3\beta 2]$.

(b) We will observe flipped signs on the probability scale if $p\u02dc0<p\u02dc1[1+\beta 3\beta 2]$.

(c) We will observe no effect heterogeneity on the probability scale if $p\u02dc0=p\u02dc1[1+\beta 3\beta 2]$.

If $\beta 3=0$ (indicating an identical effect of $X$ on the odds-ratio scale in group 1 and group 0), then:

(a) We will not observe effect heterogeneity on the probability scale if $p\u02dc0=p\u02dc1$.

(b) We will observe effect heterogeneity on the probability scale if $p\u02dc0\u2260p\u02dc1$.

In cases 1(a), 2(a), and 3(a), we observe the same qualitative story of effect heterogeneity on the odds-ratio scale and probability scale. In cases 1(b) and 2(b), in contrast, we observe flipped signs (i.e., the opposite answers to the question “Is the effect larger in group 1 or group 0?” on the odds-ratio scale versus the probability scale). These are the cases highlighted in Eq. (4). In cases 1(c), 2(c), and 3(b), we observe different but not opposite-sign qualitative stories of effect heterogeneity across scales: one scale indicates effect heterogeneity but the other does not. Researchers can investigate whether the inequalities in 1(b) and 2(b) (summarized in Eq. (4)) hold in their applications, using the coefficient values and outcome probabilities that they estimate from their own data. If the inequalities hold, then the researchers face the problem of flipped signs: the answer to the question “Is the effect larger in group 1 than group 0?” flips signs in their application across odds ratios versus probabilities.

### Odds Ratios Versus Risk Ratios

where $RRg$ is the risk ratio in group $G=g$ among people with background covariates $Z=z$ when the key predictor $X=x$ versus when $X=x+1$; $pg,x|z$ ($pg,x+1|z$) is the probability of experiencing the outcome event in group $G=g$ among people with background covariates $Z=z$ when the key predictor $X=x$ ($X=x+1$). The OR-to-RR *conversion factor*, $pg*=(1\u2212pg,x|z)(1\u2212pg,x+1|z)$, quantifies how much the association between $X$ and the outcome differs across scales.^{11} The conversion factor can be interpreted as the ratio of probability that the outcome event does not occur in group $G=g$ among people with background covariates $Z=z$ when $X=x$ versus when $X=x+1$. When the conversion factor is greater than (less than) [equal to] one, the outcome event is less (more) [equally] likely in group $G=g$ among people with background covariates $Z=z$ when $X=x$ versus when $X=x+1$.

^{12}Take the log of Eq. (5) and then take the difference between groups $G=1$ and $G=0$:

Working through the logic in greater detail:

′. If $\beta 3>0$, then:

(a) We will not observe flipped signs on the risk-ratio scale if $\beta 3\u2212[ln(p1*)\u2212ln(p0*)]>0$$\beta 3\u2212[ln(p1*)\u2212ln(p0*)]>0$, which happens when $e\beta 3p0*p1*>1$ and, thus, $p0*>p1*e\u2212\beta 3$.

(b) We will observe flipped signs on the risk-ratio scale if $p0*<p1*e\u2212\beta 3$.

(c) We will observe no effect heterogeneity on the risk-ratio scale if $p0*=p1*e\u2212\beta 3$.

′. If $\beta 3<0$, then:

(a) We will not observe flipped signs on the risk-ratio scale if $p0*<p1*e\u2212\beta 3$.

(b) We will observe flipped signs on the risk-ratio scale if $p0*>p1*e\u2212\beta 3$.

(c) We will observe no effect heterogeneity on the risk-ratio scale if $p0*=p1*e\u2212\beta 3$.

′. If $\beta 3=0$, then:

(a) We will not observe effect heterogeneity on the risk-ratio scale if $p0*=p1*$.

(b) We will observe effect heterogeneity on the risk-ratio scale if $p0*\u2260p1*$.

In cases 1′(a), 2′(a), and 3′(a), we observe the same qualitative story of effect heterogeneity on the odds-ratio scale and risk-ratio scale. In cases 1′(b) and 2′(b), we observe flipped signs; these cases are highlighted in Eq. (7). In cases 1′(c), 2′(c), and 3′(b), only one scale indicates effect heterogeneity; this happens when $p1,x|z\u2260p0,x|z$, that is, when the probability of the outcome event differs across group 1 versus group 0 when background covariates $Z=z$ and $X=x$ (the baseline level of the key predictor).^{13} Researchers can investigate whether the inequalities summarized in Eq. (7) hold in their applications. If so, they face the problem of flipped signs across odds ratios versus risk ratios.

In the online appendix section A1, we extend the results presented in this section to additional NLPMs for binary outcomes (moving beyond binary logit models to probit models and fractional logit models), and we provide some step-by-step details.^{14} In appendix section A1 we also present results for a different probability scale (moving beyond AMEs to MERs). In appendix section A2, we provide the trivial extension to group-stratified models (moving beyond the pooled model including a product term to capture effect heterogeneity, as in Eq. (1)). In appendix section A3, we provide results for multinomial models.

## Practical Tips

Using the simple sets of inequalities in Eqs. (4) and (7), researchers can determine whether flipped signs are an issue in their applications (see the online appendix for similar inequalities for additional NLPMs). Here, we provide some practical tips for researchers concerned about flipped signs to consider when quantifying and interpreting effect heterogeneity from NLPMs.

### Tip #1: Know Your Outcome Event Probabilities

Equations (4) and (7) show that effect heterogeneity can flip signs across scales when the outcome event probability differs across groups. This result holds because on scales other than the odds ratio, effect heterogeneity is a function of not only $\beta 3$ (the coefficient on the product term between key predictor $X$ and group indicator $G$) but also the conditional probabilities of the outcome event within each group. Next, we illustrate that flipped signs are particularly likely across the odds-ratio scale versus the probability scale (1) when these outcome probabilities differ substantially across groups and (2) when they are extreme (i.e., close to zero or one). On the basis of these results, we suggest that researchers get to know their outcome event probabilities for insight into whether their effect heterogeneity may flip across scales.

#### Illustrating the Problem

Figure 2 depicts how the likelihood of flipped signs varies with the groups' outcome probabilities in the case of binary logistic regression.^{15} We explore the analogous case of multinomial logistic regression in online appendix Figure A1 (see discussion in appendix section A3). There are two main findings.

First, flipped signs are more likely when the groups' outcome probabilities, $p1$ and $p0$, are either very low or very high (i.e., close to zero or one) than when they are moderate (Figure 2, panel a; darker colors reflect higher chances of flipped signs).^{16} Results are symmetric across the horizontal line (vertical line) at $p1=.5$ ($p0=.5$), with flipped signs becoming increasingly likely as $p1$ ($p0$) becomes more extreme.^{17} However, flipped signs are less likely when $p0$ is extreme than when $p1$ is equally extreme. This asymmetry is evident in Eq. (3), which includes one function of $p0$ but two functions of $p1$ (because $\beta 3$ is the coefficient on the product between the indicator for group $G=1$ and the key covariate $X$).^{18} Demographers using discrete-time event-history analyses should take note, because event probabilities within each time period are typically low; thus, flipped signs are likely. Demographers using logistic regression because their outcome event probabilities are close to zero or one (and, thus, linear probability models might predict probabilities outside the [0,1] range) should also take note: their effect heterogeneity is likely to flip signs across the odds-ratio versus probability scales.

Second, the larger the absolute difference in the groups' outcome probabilities, $|p0\u2212p1|$, the higher the chance of flipped signs (Figure 2, panel b).^{19} For example, flipped signs occur in just under 10% of the scenarios that we explore when the difference $|p0\u2212p1|=.10$ versus around 24% of scenarios when the difference $|p0\u2212p1|=.90.$This difference is more likely to be large when at least one of the groups' probabilities is extreme (e.g., for the difference to be .95, one group's probability must be very high and the other group's must be very low). Thus, panels a and b are consistent: flipped signs are more likely when the group difference in probabilities is larger and when each group's probability is more extreme.

#### Addressing the Problem

Always calculate the outcome event probabilities in the comparison groups. Particularly if these probabilities are extreme or if they differ substantially between groups, use them in Eqs. (4) and (7) to determine whether flipped signs are a concern in your application. Because the outcome event probabilities differ across the values of all predictor variables (not only the group indicators), evaluate whether flipped signs are a concern at a variety of predictor values.

### Tip #2: Quantify Effect Heterogeneity on Multiple Scales

Given the possibility of flipped signs, researchers who quantify effect heterogeneity on only one scale risk making overly general statements (such as “the effect is larger in group A than B,” which implies that this statement is true regardless of scale) when they should make conditional statements (such as “the effect is larger in group A than B on the odds-ratio scale, but smaller on the probability scale”). Presenting odds ratios alone can mislead readers; quantifying effect heterogeneity on multiple scales will help researchers communicate accurate information and avoid making unconditional statements without proper justification. It will also force them to think deeply about the substantive meaning and importance of different scales. Below we elaborate on the problems with using one scale; then we discuss solutions.

#### Illustrating the Problem

Regardless of the direction or magnitude of effect heterogeneity on the odds-ratio scale, flipped signs might occur; logistic regression coefficients do not reliably reveal the sign of effect heterogeneity on the probability scale (Figure 3). Points are plotted in Figure 3 whenever a flipped sign occurs from a given combination of $\beta 2$ and $\beta 3$ (the coefficients on the main effect of $X$ and its product with the indicator that $G=1$) and $p0$ and $p1$ (the outcome event probabilities in groups 0 and 1). Looking across the panels of Figure 3,^{20} there are almost no combinations of $\beta 2$ and $\beta 3$ that have a zero risk of flipped signs across the three combinations of $p0$ and $p1$ that we explore.^{21} Flipped signs are most common when the main-effect coefficient $\beta 2$ is large and the product coefficient $\beta 3$ is small. However, flipped signs are not limited to these situations. For example, when there is a large group difference in outcome probabilities, flipped signs may occur when both $\beta 2$ and $\beta 3$ are extreme (Figure 3, “high-danger” scenario).^{22} In short, examining effect heterogeneity on the odds-ratio scale alone leaves researchers vulnerable to telling an unconditional story of effect heterogeneity, when in fact the story should be conditional on the scale. To avoid telling the wrong story, examining multiple scales is crucial.

#### Addressing the Problem

Researchers can create tables or figures that allow for easy comparison of effect heterogeneity on multiple scales. We provide an example in Table 1, where we also step through why the heterogeneity flips signs across scales. Such a didactic presentation may be unnecessary in many applications; but it may be helpful when researchers have a theoretically motivated preference for a scale that is unfamiliar to their readers (and their effect heterogeneity flips signs across the preferred versus familiar scales).

Table 1 reports the (associational) effects of cohort (our key predictor in this case study) on marriage (our outcome) among White and Black people in the United States (our comparison groups).^{23} Panels A and B show results for women and men, respectively. Within each panel, the first two rows report effects for each of our groups and the third row shows the group difference in effects; the columns report the effects on different scales.

On the log-odds scale, cohort declines in marriage are faster among Black people than White people; the same is true on the odds-ratio scale, of course (Table 1, columns 1 and 2): $e\u2212.0193=.9809$, indicating that the monthly odds of first marrying decline about 2 percentage points ($1\u2212.9809=.0191$) more per year among Black than White people.^{24}^{,}^{25} However, the reverse is true on the probability scale, as shown via AMEs.

Even if $\beta 3=0$, $AMEB$ and $AMEW$ would differ because Black and White people's typical marriage probabilities differ. On average, Black women's monthly chance of first marrying is about .14%, which is about .21 percentage points lower than White women's monthly chance of .35% (Table 1, column 3).^{26} To convert effect differences on the log-odds scale to the probability scale, we must multiply $\beta 3$ by the product of these probabilities and their complements (column 4 contains these products; see Eqs. (2) and (3)).^{27} Monthly first marriage probabilities decline about .009 percentage points per year among Black women, which is about .006 less than the .015 decline among White women (Table 1, column 5). This sign flip—to slower declines among Black than White women on the AME scale, despite faster declines on the odds-ratio scale—has a simple interpretation: because the outcome event probabilities are much smaller among Black than White women, smaller absolute declines among Black women (captured in AMEs) are larger relative declines (captured in odds ratios, relative to their low baseline event probabilities).^{28}

Notably, these AME patterns (calculated from the NLPM in Eq. (1)) are very similar to the AME patterns calculated from a simple linear probability model (LPM; which removes the logit link from Eq. (1)). The AME of cohort among White (Black) people from this LPM is simply the coefficient on cohort (plus the coefficient on the cohort × racialized group product). Both the NLPM and LPM (Table 1, columns 5 and 6, respectively) indicate that the AME of cohort is more negative among White than Black people. The AMEs from the LPM and NLPM are not identical because these models weight observations differently, but they tell the same story (Holm et al. 2015; see also Mare 1981).^{29} In short, cohort effects on the probability scale (whether calculated from LPMs or NLPMs) suggest faster marriage declines among White than Black people, while cohort effects on the odds-ratio scale suggest the reverse (slower declines among White people). Table 1 provides a template for researchers to consider when deciding how to present their own effect heterogeneity results on multiple scales. Note that Table 1 contains rows devoted to the difference in cohort effects between racialized groups (not only rows devoted to the effect's magnitude within each group). Quantifying this difference is important for understanding heterogeneity; providing only stratum-specific effects is insufficient (Knol et al. 2009:165).

Examining effect heterogeneity on multiple scales is beneficial because it helps researchers make appropriately qualified statements about whether the effect is larger in one group than another (conditioning their statements on the effect's scale, as need be). However, two caveats are in order. First, researchers should not examine different scales in search of statistically significant differences. This type of “*p*-hacking” not only invalidates confidence levels but also risks mistaking differences that are statistically significant (insignificant) for differences that are substantively meaningful (meaningless). Second, researchers should think carefully about how the scales that they examine speak to their research question and scientific theory. Rotely reporting multiple scales can help researchers avoid inappropriately unconditional statements about the group in which effects are larger but it also risks substituting clutter for meaningful scientific insight. We elaborate on these two points next.

### Tip #3: Look at the Size of Effect Heterogeneity, Not the Stars

Although all researchers are aware of the difference between statistical significance and practical significance, we often fall into the trap of “star gazing.” We must work hard to avoid this trap because effect heterogeneity may be substantively large but statistically insignificant, or substantively small but statistically significant.^{30} Researchers should focus on the magnitudes of effect heterogeneity on different scales, not (only) their statistical significance. Doing so is especially important when flipped signs may occur, because effect heterogeneity can be statistically significant on multiple scales (or statistically insignificant on multiple scales) and yet flip signs across scales. Researchers exclusively focused on testing for statistical significance would miss the substantively important difference across scales. Below we discuss an example in which statistical tests align across scales but the direction of effect heterogeneity flips signs. We then discuss how to address this problem by thinking carefully about the substantive meaning of different scales.

#### Illustrating the Problem

Returning to our example of marriage trends across birth cohorts, Figure 4 illustrates that across racialized groups, genders, and effect scales, the marriage–cohort association is negative: more recently born people marry later in life than their earlier-born peers. Yet the association is *less* negative among White than Black people in odds ratios (Figure 4, panel a) but *more* negative in average marginal effects (panel b). The direction of effect heterogeneity flips signs across scales. However, the effect heterogeneity is not statistically significant at standard levels ($\alpha =.05$) on either scale for men (the 95% confidence intervals highlighted in gray contain zero, although they are mostly negative on the odds-ratio scale and mostly positive on the AME scale); for women, the effect heterogeneity is only significant on the AME scale. Thus, researchers solely focused on testing might conclude that we cannot determine whether Black and White men's marriage trends differed and they might, in a “*p*-hacking” tradition, highlight only the women's results on the AME scale and conclude that White women's marriage probabilities are converging toward Black women's low marriage probabilities. This conclusion might not be incorrect, but it is certainly incomplete in at least two ways. First, there is little indication that the racial difference in trends varied by gender (despite the “stars” aligning only for women). Second, and more importantly from an estimation perspective, although the AME results suggest racial convergence (because White people's marriage declines were faster than Black people's on the probability scale), the odds-ratio results suggest racial divergence (because White people's marriage declines were slower than Black people's on the odds-ratio scale). Facing this fact should force researchers to reckon with the substantive meaning of change on each scale and how these different changes speak to demographic theory. In short, focusing on the size of effect heterogeneity rather than the stars pushes researchers to take their own estimates seriously.

#### Addressing the Problem

The solution to the problem of flipped signs lies in not only acknowledging that they exist (if they do, in a specific application) but also providing a clear and compelling scientific rationale for which scale is substantively important in one's application. In the preceding application, we determined that neither the odds-ratio scale nor the AME scale was the most demographically relevant one. The reason was simple: both scales quantified the (associational) effects of cohort on individuals' monthly marriage hazards. But what we cared about was not what happened in a given month, but instead what happened over a life course. Thus, we selected a third scale for our primary analysis: the cumulative probability of ever marrying by age 40. We found Black–White divergence in this cumulative probability despite Black–White convergence in absolute marriage hazards.^{31}

The most important scales to consider are application-specific. Take another example: Cui and collaborators (2019) examine gender differences in mortality trends. They select life expectancy at birth ($e0$) as their scale of key interest because of “public concern” about this metric and, thus, about how the gender gap in $e0$ has changed (Cui et al. 2019:2308). Yet even in applications tightly focused on one specific scale, examining effect heterogeneity on multiple scales can provide insight into underlying mechanisms. Cui et al. (2019) note that even when there is no gender difference in the (associational) effect of cohort on mortality rates (indicating equal speed of change among men and women), there can be a gender difference in the effect of cohort on $e0$ (reflecting a changing gender gap in life expectancy).

The space of potential scales for demographic analyses is very large; thus, selecting the best scale in each application requires serious theoretical consideration. One key question to consider is whether the scale should be *absolute* or *relative*.^{32} We recommend absolute scales in most cases. Absolute (additive) changes have several benefits over relative (multiplicative) changes. They avoid exaggerating small differences, particularly in applications when baseline probabilities are low (Citrome 2010). For example, an AME indicating a 1-percentage-point increase in the chance of experiencing the outcome event could, on the relative scale, be a doubling in that chance, if the baseline chance is only 1%. Experiments suggest that people find absolute changes more intuitive than relative changes, which impose higher cognitive burden and require more “cognitive warming” to interpret correctly (Prevodnik et al. 2014; Prevodnik and Vehovar forthcoming:30).

Absolute changes are also more appropriate than relative changes to compare across groups if the goal is to determine which group to target with treatment, particularly when resources are constrained (Greenland 2009). For example, if we had 100 doses of a vaccine that decreased the probability of an illness by 5 percentage points in group A but 10 percentage points in group B, then we would prevent five additional illnesses if we targeted the doses at group B than if we targeted them at group A. Only effects on the absolute scale provide this information consistently; relative scales can lead to mistargeting, depending on the baseline probabilities. Similarly, group differences in time trends on the absolute scale provide information about whether the groups are converging or diverging in their event probabilities, while relative trends may not. Further, tests for interactions on the absolute scale can sometimes be more powerful than tests on the relative scale (VanderWeele and Knol 2014). In fact, a lack of effect heterogeneity on the relative, odds-ratio scale strongly suggests the presence of effect heterogeneity on the absolute, probability scale (Knol et al. 2009).^{33}

The size of effect heterogeneity often differs across scales; it also often differs across units with different covariate values (because on many scales it depends on outcome event probabilities, which themselves differ across units with different covariate values; see Eqs. (2) and (3)). Thus, researchers considering the size of effect heterogeneity should also explore multiple covariate values. Visual approaches are particularly useful for this task (King et al. 2000; Mize 2019; Ruhe 2018). For example, a researcher could plot group differences in marginal effects on the *y*-axis of a graph against a covariate on the *x*-axis, separately for different values of a second covariate using different line types.

In sum, the flipped-signs phenomenon highlights the importance of taking our own estimates seriously by interpreting them on scales that are meaningful for our particular applications. While fetishizing *p* values is never advisable, it is particularly risky when flipped signs may occur because it can lead researchers to ignore scale dependence (and the information that it carries about group differences in outcome event probabilities).^{34}

### Tip #4: Avoid Odds Ratios in Most Circumstances

While there are many possible scales on which to quantify effect heterogeneity, we recommend against odds ratios in most circumstances. Instead, use alternative probability-based scales, including average marginal effects, cumulative probabilities, and relative risks.

#### Illustrating the Problem

Methodologists have long argued that odds ratios should not be compared across groups (Allison 1999; Mood 2010; Wooldridge 2002). This argument rests on the fact that group differences in odds ratios reflect group differences in residual heterogeneity, not only group differences in effects. Whether this argument applies only when considering categorical outcome variables to be coarse measures of continuous latent variables has been debated recently. Only continuous variables have residual variances that may differ across groups (Buis 2017; Kuha and Mills 2020). However, differences in residual variances can be conceptualized, alternatively, as differences across groups in the effects or variances of omitted predictors—and, in logit and probit models (unlike in linear models), coefficients on key predictors are sensitive to the inclusion/exclusion of omitted predictors, even if they are independent of the key predictors (Breen et al. 2018). The implication of this argument is that group differences in logistic regression coefficients (i.e., log-odds ratios) can be interpreted as effect heterogeneity only if all outcome predictors are included in the model (which is infeasible in most applications); otherwise, these differences might reflect group differences in omitted predictors' coefficients or distributions.

In addition to these technical concerns, people often misinterpret odds ratios as risk ratios (Grimes and Schulz 2008); odds ratios are fairly difficult to interpret for people not immersed in betting or biostatistics. Because odds ratios are more extreme than risk ratios (further above [below] one if the effect is positive [negative]), odds ratios may produce inaccurate impressions of exaggerated effects (among people mistaking odds ratios for risk ratios).

Mistaking odds ratios for risk ratios is particularly problematic because effect heterogeneity can flip signs across these two scales (see Eqs. (6) and (7)). Figure 5 describes scenarios in which we observe such flips.^{35} Flipped signs across odds ratios versus risk ratios are particularly likely when the outcome probability at the baseline level of $X=x$ is high in both groups, and (less so) when it is low in both groups (Figure 5, panel a).^{36} In contrast to flipped signs across AMEs versus odds ratios—which become increasingly likely as the group difference in outcome probabilities increases (Figure 2, panel b)—flipped signs across odds ratios versus risk ratios become less likely as the difference increases past .25 (Figure 5, panel b). However, the trend is nonlinear; as the difference increases between zero and .25, flipped signs become more likely.^{37} In sum, flipped signs across odds ratios versus risk ratios appear most likely when the group difference in baseline outcome probabilities is around .25.

#### Addressing the Problem

Many scholars recommend that effects from NLPMs always be presented on the probability scale (e.g., Long 2009; Mize 2019). Several probability-scale effects can be quantified, depending on the size of the change in the key predictor variable $X$ and the values of all predictor variables (including $X$ and any other predictors).^{38} Regarding the size of the change, researchers can assess marginal probability changes associated with infinitely small changes in $X$ (based on derivatives of the cumulative distribution function relating the outcome probabilities to the predictor variables) or discrete probability changes associated with larger finite changes in $X$ (based on first differences in probabilities at different $X$ values). Although marginal probability changes do not equal discrete probability changes associated with one-unit changes in $X$, they are often similar (Petersen 1985).

Regarding predictor values, researchers can assess changes at a variety of values of $X$ and all other predictors. Because the function relating probabilities and predictors is nonlinear, identically sized changes in $X$ associate with different-sized probability changes at different predictor values. Researchers may use the observed predictor values; each observation then has its own estimated effect based on its own predictor values, and researchers may average across the observations to obtain a summary measure (like the AME, discussed earlier). Alternatively, researchers may select the means of the predictor variables (or modes, for categorical predictors). More generally, researchers can select “representative values” (Long and Mustillo 2021:13), which may differ from the means or modes. Presenting multiple marginal effects at representative values reveals how effect heterogeneity varies across units with different predictor values (because their baseline event probabilities differ).

For demographers using discrete-time event-history models, all of the foregoing probability-scale effects (e.g., AMEs, MERS, first differences) quantify differences within each time unit (e.g., differences in monthly first marriage probabilities). These may not be quantities of great scientific interest. Often, cumulative probability differences will be more informative than instantaneous probability differences, particularly when whether something ever happens is more important than how *quickly* or *slowly* it happens. For example, it may be more important to understand whether older people from high-income countries are less likely to *ever* die from COVID-19 than older people from low-income countries, versus whether they die from COVID-19 more *slowly* but ultimately die from this cause at similar rates.

Risk ratios also have benefits compared with odds ratios. While odds ratios are sensitive to the inclusion of predictors that are independent of the key predictors of interest, risk ratios are not (Breen et al. 2018). Risk ratios are also collapsible, unlike odds ratios; the weighted average of subpopulation-specific odds ratios will not equal the full-population odds ratio, even in the absence of confounding (Pang et al. 2013).

#### Caveat

Given the methodological and interpretive challenges associated with odds ratios, we recommend against their use in general. However, odds ratios are appropriate in specific circumstances. In retrospective case–control studies, absolute risks cannot be estimated and, thus, neither can risk ratios nor probability differences; odds ratios are necessary in these circumstances (Andrade 2015). Odds ratios also have two other benefits: they are symmetric, unlike risk ratios; they do not depend on whether an event is coded as death or survival (e.g., marriage or nonmarriage) (Cummings 2009). Odds ratios are also invariant across predictor values; they do not require researchers to select predictor values in order to summarize their effects of interest (Long and Mustillo 2021). In our view, these benefits do not outweigh the costs of employing odds ratios; at the least, we suggest reporting odds ratios alongside effects on other scales (except in retrospective case–control settings).

### Tip #5: Publish Sufficient Information for Readers to Assess Heterogeneity on Alternative Scales

We advise researchers to report effect estimates on the scales that are most theoretically meaningful in their specific substantive applications. Admittedly, this advice is easier to give than to follow because arguments about which scales are most meaningful will be somewhat subjective. The simple act of formulating an argument should advance science because it will require careful thought; nevertheless, subject-matter experts may disagree about the most appropriate scale in a given application. Consequently, we suggest that authors not only report effect heterogeneity on multiple scales (per Tip #2) but also provide sufficient information for interested readers to assess effect heterogeneity on alternative scales. In particular, to enable readers to assess effect heterogeneity on the absolute probability scale, researchers who report this heterogeneity on the odds-ratio scale (or, more generally, in terms of NLPM coefficients) should supplement their reports of key coefficients with all coefficient values (including the intercept), outcome event probabilities in each group, and descriptive statistics for all predictors. These quantities are needed to calculate MERs.^{39}

#### Illustrating the Problem

A little less than a third of research articles published in *Demography* in 2018–2019 used NLPMs (for reasons beyond creating weights); 47% of these explored effect heterogeneity (either by including a product term or by reporting stratified results) (see Figure 1).^{40} NLPMs were slightly less popular in the *American Sociological Review* and the *American Journal of Sociology* (appearing in 22% and 23%, respectively, of their 2018–2019 research articles), but conditional on appearing, NLPMs were highly likely to be used to assess effect heterogeneity (65% and 80% of the time). Clearly, then, researchers publishing in the top demography and sociology journals are likely to consider effect heterogeneity using NLPMs and, thus, to encounter situations when flipped signs could occur.^{41}

We endeavored to assess how often they occurred in practice. However, we were able to complete this assessment for only a small subset of our population of interest (i.e., original research articles published in 2018–2019 in the above-mentioned journals that used NLPMs to explore effect heterogeneity). Most of these articles did not provide sufficient information to complete this assessment. In particular, most did not report coefficients (or exponentiated coefficients) for all predictor variables, including the intercept. This information is necessary for calculating effects on the absolute, MER scale.^{42} Only 14 articles total provided sufficient information. The authors of these articles set the standard for results reporting. We note 33 cases of potential flipped signs from these 14 articles because we consider multiple relevant models from each article (all NLPMs that consider variation in outcome–predictor associations and that report all coefficients, including the intercept). To avoid highlighting effect heterogeneity that the original articles' authors did not consider important, we consider only the heterogeneity that the authors either formally tested and reported to be statistically significant or interpreted substantively.

Of the 33 cases that fit our selection criteria, 15% (5 / 33) evidence flipped signs (online appendix Table A1).^{43} For example, Yavorsky and collaborators (2019) provide fascinating information on gender differences in the predictors of being in the top 1% of the personal and household income distributions in the United States, 1995–2016. Focusing on odds ratios, they report that “the positive association between self-employment and personal one percent status is stronger for women than for men . . . compared to men, self-employment is more important for women to earn exceptionally high income” (Yavorsky et al. 2019:68). Indeed, the odds ratio is 30.1 for women versus 9.3 for men ($e3.4050$ vs. $e3.4050\u2009\u2212\u20091.1770=e2.2280$). However, the MER is .0687 for women versus .1729 for men ($.0687\u2212\u2212.1042$); in other words, being self-employed is associated with about a .07-percentage-point increase in the chance of being in the top 1% of income earners for women versus a .17-percentage-point increase for men. On the absolute, percentage-point scale, then, self-employment appears to be more important for men than for women. None of the other four types of effect heterogeneity from Yavorsky et al. (2019) reported in appendix Table A1 evidence flipped signs. Moreover, they correctly interpret the odds ratios in their article. However, the conclusion about the self-employment effect heterogeneity is scale-dependent, flipping across the absolute versus relative scales. Thinking counterfactually, the absolute effect difference suggests that similar increases in self-employment among women and men would be expected to increase gender inequality in top 1% status. Mistakenly using the relative effects in this counterfactual logic would incorrectly suggest the reverse (that similar self-employment increases among women and men would reduce gender inequality).

Their study of top 1% income may be particularly vulnerable to flipped signs because not only are outcome probabilities low by definition, but “there are stark gender differences in personal one percent status” (Yavorsky et al. 2019:65). Flipped signs are particularly likely when group differences in outcome probabilities are large (as discussed earlier).

#### Addressing the Problem

To enable readers to assess effect heterogeneity on multiple scales, researchers using NLPMs should include all coefficient values (including the intercept) in published tables (following the practice of the authors of the articles included in appendix Table A1). We suggest using appendix tables for this task (and focusing main tables on the key contrasts of interest). Researchers who personally wish to highlight effect heterogeneity on the odds-ratio scale may find that a few NLPM coefficients suffice (although we advise against this practice, per Tip #4). But if they nevertheless provide all coefficients, they allow other researchers to calculate effect heterogeneity on alternative scales. In addition to all NLPM coefficients, researchers should report outcome event probabilities in each group and descriptive statistics for all predictors to help readers select reasonable representative values.

## Discussion

Researchers often study effect heterogeneity—defined here as variability in the association between an outcome variable and a predictor variable across groups defined by the values of another predictor variable—in order to understand how inequalities between groups evolve or which groups benefit most from certain treatments (e.g., Kalil et al. 2012; Kuo and Raley 2016; Manski 2007). However, interpreting effect heterogeneity is tricky because whether an outcome–predictor association is larger in one group or another group is scale dependent. This fact is known to methodologists (Brumback and Berg 2008; VanderWeele 2019), but it has not been fully integrated into applied researchers' work. Indeed, researchers might anticipate *magnitude* differences across scales; yet they might be surprised that effect heterogeneity can *reverse directions* across scales, flipping signs. For example, a group difference may be positive (indicating a larger effect in group A than B) on an absolute, percentage-point scale but negative on a relative, odds-ratio scale. Similar to how Simpson's paradox highlights that an outcome–predictor association can reverse when estimated in the full population versus in subpopulations, the *flipped-signs phenomenon* highlights that group differences in an outcome–predictor association can reverse across scales.^{44}

We make two contributions to help applied researchers understand and present effect heterogeneity. First, we specify when flipped signs occur in the context of nonlinear probability models. Researchers can use the inequalities in Eqs. (4) and (7) to check whether flipped signs occur in their applications by plugging in their estimated NLPM coefficients and group outcome probabilities. Second, we provide practical tips to help researchers comprehend and communicate their effect heterogeneity results (specifically, regarding the groups in which their key effects are largest). For example, we advise researchers to know their outcome event probabilities (Tip #1), because flipped signs are particularly likely when group differences in outcome probabilities are large. As another example, we advise researchers to calculate effect heterogeneity on multiple scales (Tip #2). Doing so will help avoid unwarranted sweeping statements about effects being larger in one group than another. Instead, researchers should qualify their statements by specifying the scales on which the effects are larger. This practice should push researchers to think deeply when determining which scales to employ in their particular application. Our discussions about the benefits of absolute scales over relative scales may prove useful in this determination.

Researchers who find it onerous to transform the output from NLPMs into effects on the absolute, probability scale may consider using linear probability models.^{45} The misspecification bias in marginal effects from LPMs (when logit or probit models would have been more appropriate) is typically small (Holm et al. 2015), and the marginal effects themselves are invariant across covariate values, easing interpretation (Long and Mustillo 2021). But LPMs are not panaceas; flipped signs can occur in linear models as well. This fact is particularly useful to recognize in applications with logged outcome variables. In these applications, linear-model coefficients are marginal effects on the absolute scale with respect to the logged outcome; but these marginal effects are often interpreted as percentage changes (i.e., relative changes with respect to the unlogged outcome). A larger percentage increase in group A than group B could be a smaller absolute increase in group A than B (in unlogged terms, if the baseline, unlogged outcome level is lower in group A than B). Thus, researchers using linear models with log specifications should take care to assess whether flipped signs occur in their applications (and to make appropriately qualified statements about whether an effect is larger in group A than B, conditioning on the scale).

A caveat regarding terminology is important, however: as stated in the Introduction, we use the term *effect heterogeneity* to include all types of differences in outcome–predictor associations across values of a third variable, not only differences in causal associations (nor only differences in causal associations generated by a third variable that is itself causal). For demographers in particular, understanding population heterogeneity is crucial even when the heterogeneity is not causal in nature (Duncan 2008; Xie 2007). Yet it is important to recognize that effect heterogeneity may appear because, from a causal perspective, models are misspecified or measurement errors are dominating. Observational studies aimed at estimating causal effects are subject to multiple biases (Morgan and Winship 2015; Xie et al. 2012; for discussion of confounding versus effect heterogeneity, see also VanderWeele 2012). Yet, regardless of whether effect heterogeneity is causal, we have provided novel information on the circumstances in which it flips signs across scales. We hope that demographers will emerge better equipped to describe the population variation that we all aim to understand.

## Acknowledgments

We gratefully acknowledge support from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (research grant P01HD087155 and center grant P2CHD041028). We benefited from the insightful discussions and comments of the *Demography* editors and reviewers, as well as William Axinn and Anders Holm.

## Notes

^{1}

Following common practice, we also use the terms *effect heterogeneity* and *interaction* interchangeably. However, some researchers use effect heterogeneity (or, equivalently, effect modification) to capture circumstances in which an outcome–treatment association differs across the values of a pretreatment variable, while *interaction* captures circumstances in which an outcome–treatment association differs across the values of a second treatment variable (Knol et al. 2009; VanderWeele and Knol 2014).

^{2}

For example, in a binary logistic regression, the logit transformation links the outcome’s mean to a linear combination of predictors and coefficients. Poisson models are not always classified as NLPMs, but they function similarly because their coefficients’ scales differ from their outcomes’ scales.

^{3}

We exclude from these proportions’ numerators all articles that use NLPMs solely to construct weights.

^{4}

Flipped signs are also obscured in the substantial literature that considers how to specify NLPMs to capture effect heterogeneity, including whether and when to include linear and nonlinear product terms (e.g., Beiser-McGrath and Beiser-McGrath forthcoming; Berry et al. 2010; Brambor et al. 2006; Hainmueller et al. 2019; Rainey 2016).

^{5}

Some advocate comparing marginal effects and predicted differences on the probability or percentage-point scale (Ai and Norton 2003; Landerman et al. 2011; Long and Mustillo 2021; Mize 2019). Others advocate comparing adjusted predictor–outcome coefficients or correlations on the scale of the latent continuous variable presumed to underlie the categorical outcome (Allison 1999; Breen et al. 2014; Williams 2009). It also remains common to compare odds ratios across groups, despite controversy over the validity of these comparisons (Kuha and Mills 2020; Mood 2010).

^{6}

We also provide inequalities for the relative risk-ratio scale versus the probability scale in the online appendix; relative risk ratios are exponentiated coefficients from multinomial logit models, like odds ratios are exponentiated coefficients from binary logit models.

^{7}

When researchers fail to understand the scale dependence of the effects that they report, the consequences can be profound. For example, in 2020 the United States issued an emergency-use authorization for convalescent plasma to treat coronavirus partly because the FDA commissioner claimed that “35 out of 100 people” might “survive as a result of it.” In fact, the number 35 was a preliminary estimate of an effect on the relative, risk-ratio scale, not the absolute, percentage-point scale (as the commissioner later noted); the absolute risk-reduction estimate was around 3 percentage points, not 35 (Blake 2020).

^{8}

Note that when estimated from Eq. (1), $pi$ depends on $X$, $G$, and $Z$. Researchers interested in capturing effect heterogeneity on the probability scale without incorporating group differences in $Z$ may prefer MERs over AMEs; see appendix.

^{9}

This more positive effect on the odds-ratio scale will be a larger (smaller) effect in group 1 than in group 0 if $\beta 2\u22650$$(\beta 2<0)$.

^{10}

This less positive effect on the odds-ratio scale will be a smaller (larger) effect in group 1 than in group 0 if $\beta 2>0$$(\beta 2\u22640)$.

^{11}

If $pg,x+1|z$ and $pg,x|z$ are both close to zero (sometimes called the *rare disease assumption* in epidemiology), then $ORg\u2248RRg$; otherwise, $ORg$ will appear more extreme (further from one) than $RRg$.

^{12}

As noted earlier, the sign of group heterogeneity will always be the same on the odds-ratio scale and the log-odds scale; when $OR1/OR0=e\beta 3$ is less than (greater than) [equal to] one, then $ln(OR1)\u2212ln(OR0)=\beta 3$ is negative (positive) [zero], which indicates that $X$’s predictive effect is more negative (more positive) [the same] in group $G=1$ than group $G=0$.

^{13}

For another way to verify this fact, start from Eq. (6) and note that the denominator of $pg*$ can be rewritten as $1\u2212RRgpg,x|z$ (see also Shrier and Pang 2015). Then, take the derivative with respect to $pg,x|z$. Next, rearrange Eq. (6) to express $RRg$ in terms of $ORg$ and take the derivative:

Equation (F1) shows that when $RRg>$ 1 ($RRg<$ 1), $\u2202ORg\u2202pg,x|z$ is positive (negative), meaning that if $RR1=RR0$ and $p1x|z<p0x|z$, then $OR1<OR0$ ($OR1>OR0$). That is, even though there is no group difference in effect on the RR scale, there is a group difference in effect on the OR scale. Likewise, Eq. (F2) shows that when $ORg>$ 1 ($ORg<$ 1), $\u2202RRg\u2202pg,x|z$ is positive (negative), meaning that if $OR1=OR0$ and $p1x|z<p0x|z$, then $RR1<RR0$ ($RR1>RR0$). That is, even though there is no group difference in effect on the OR scale, there is a group difference in effect on the RR scale.

^{14}

In fractional logit models, the outcome variable $Y$ is a proportion modeled with a logit link (Papke and Wooldridge 1996).

^{15}

We simulate thousands of scenarios by creating a grid of parameter space to explore, including all values of $p0$ and $p1$ (the average probability of the outcome in each of our two groups) between 0 and 1 in steps of .05 and all values of $\beta 2$ and $\beta 3$ (the coefficients on the main effect of our arbitrary covariate of interest $X$ and its product with our group indicator for $G=1$) between −2 and 2 in steps of .20. We substitute these values into Eq. (3) and observe when the direction of effect heterogeneity flips (such that the effect of $X$ is larger in group 0 than group 1 on the AME scale but smaller on the OR scale, or vice versa). We approximate the mean of $pgi(1\u2212pgi)$, denoted $p\u02dcg$ in Eq. (3), with $pg(1\u2212pg)$.

^{16}

These low and high scenarios are mirror images; they simply reflect the outcome coding (e.g., if $p1$ is very low when coding $Y=1$ if the event occurs and 0 otherwise, then $p1$ will be very high when coding $Y=1$ if the event does not occur and 0 otherwise).

^{17}

When both $p1$ and $p0$ equal .5—and, more generally, whenever they are exactly equal—we observe about a 10% chance of a flipped sign (evidenced by light-gray boxes along the diagonals of Figure 2, panel a). These scenarios are not interesting because exact equality in groups’ probabilities is unlikely in real data.

^{18}

Switching the group coding (to include an indicator for $G=0$) would, of course, reverse the asymmetry.

^{19}

The one exception to this trend appears across $|p0\u2212p1|=.95$ versus 1, when the chance of flipped signs declines. However, the only way for $|p0\u2212p1|$ to equal 1 is for the event to always occur in one group but never in the other group. This scenario is not relevant for empirical demographic research.

^{20}

Figure 3, panel a (panel b), explores combinations of $p0$ and $p1$ when $p1$ assumes the larger (smaller) value. As discussed earlier, these scenarios are asymmetric because Eq. (3) includes two functions of $p1$.

^{21}

We explore a “high-danger” combination (of .05 and .50—wherein the difference in probabilities is large and thus the chance of flipped signs is high—plotted with large light-gray points), a “medium-danger” combination (of .15 and .50, plotted with medium-sized dark-gray points), and a “low-danger” combination (of .30 and .50, plotted with small black points). The low-danger points lie inside some of the medium-danger points, which lie inside some of the high-danger points (because whenever we observe flipped signs in the low-danger scenarios, we also observe them in the medium-danger and high-danger scenarios, but the reverse is not true, i.e., some combinations of $\beta 2$ and $\beta 3$ generate flipped signs in high-danger scenarios only).

^{22}

Whether a given set of coefficients results in flipped signs depends on the groups’ outcome probabilities. When the outcome probability is larger in group 1 than group 0 (Figure 3, panel a), flipped signs occur only when the effect of $X$ is smaller in group 1 than group 0 in the odds-ratio scale (which happens when $\beta 2$ is negative and $\beta 3$ is positive, or when $\beta 2$ is positive and $\beta 3$ is negative). Conversely, when the outcome probability is smaller in group 1 than group 0 (Figure 3, panel b), flipped signs occur only when the effect of $X$ is larger in group 1 than group 0 in the odds-ratio scale (which happens when $\beta 2$ and $\beta 3$ are both negative or when they are both positive).

^{23}

This case study uses data from the Panel Study of Income Dynamics; for details, see Bloome and Ang (2020).

^{24}

For example, the odds decline about 6% per year among Black women ($1\u2212.9403=.0597$) versus about 4% among White women ($1\u2212.9586=.0414$; Table 1, column 2).

^{25}

The log-odds differences and odds ratios vary across genders (e.g., compare row 1 vs. row 4 in Table 1, column 1), but their racial differences do not (e.g., compare row 3 vs. row 6 in Table 1, column 1) because we include a gender indicator (and product with racialized group) but no gender × cohort products.

^{26}

Monthly marriage probabilities are quite low because they accumulate with age; for example, our model predicts that by age 40, about 89% of White women born in 1970 had ever married.

^{27}

Here, $p\u02dcB<p\u02dcW$. Let the average monthly marriage hazard in racialized group $R$ be $p\xafR=1nR\u2211it\u2009\u2208\u2009Rpit$. Then, $p\u02dcB$ will be less than (greater than) $p\u02dcW$ when $p\xafB$ is less than $p\xafW$ and less than (greater than) .5. If $\beta 3=0$, then following Eqs. (2) and (3), it is clear that we would see a larger (more negative, given that $\beta 2<0$) cohort AME among White people than among Black people. In our application, $\beta ^3<0$ (indicating that the cohort effect is more negative among Black people than among White people on the log-odds and odds-ratio scales). But even so, the racial difference in typical marriage probabilities is large enough that $AME^B\u2212AME^W>0$ (indicating that the cohort effect is less negative among Black than White people on the probability scale).

^{28}

We return to the discussion of absolute versus relative declines below; note that flipped signs are also possible across two relative scales such as odds ratios and risk ratios (see Eq. (7)).

^{29}

AMEs from LPMs and NLPMs are identical in fully saturated models (including the simplest fully saturated model, which incorporates a single binary predictor variable).

^{30}

Indeed, widespread confusion between substantive and statistical significance partly motivates several prominent statisticians’ claim “that it is time to stop using the term ‘statistically significant’ entirely” (Wasserstein et al. 2019:2).

^{31}

The absolute declines in marriage hazards among Black people (though smaller than the absolute declines among White people) had *increasing returns* in terms of their impact on cumulative marriage probabilities because they declined from lower baseline levels; see also Preston and Guillot (1997).

^{32}

Group differences often flip signs across absolute versus relative scales (e.g., across AMEs versus odds ratios). Suppose that at time 1, person A has 10 apples and person B has 1,000 apples; by time 2, person A has 12 apples and person B has 1,100 apples. The absolute increase in apples over time is greater for person B ($1,100\u22121,000=100$) than person A ($12\u221210=2$). Yet the relative increase is greater for person A ($12/10=20%$) than person B ($1,100/1,000=10%$). These facts are not contradictory; they simply reflect the fact that persons A and B differed in their baseline number of apples. Thus, the heterogeneity in the (associational) effect of time on apples reverses across the absolute versus relative scales.

^{33}

Following Eq. (3), the AME in group *G* is $1nG\u2211i\u2009\u2208\u2009G\u2009\u2009\beta Gpi(1\u2212pi).$ If $\beta G=0=\beta G=1,$ then effects are homogeneous on the odds-ratio scale, but they are unlikely to be homogeneous on the probability scale because it is unlikely that $1nG\u2009=\u20090\u2211i\u2009\u2208\u2009G=0\u2009pi(1\u2212pi)=1nG\u2009=\u20091\u2211i\u2009\u2208\u2009G=1\u2009pi(1\u2212pi)$.

^{34}

For researchers interested in a testing perspective, Brumback and Berg (2008:3462) suggest testing for effect heterogeneity on multiple scales jointly by specifying a joint alternative hypothesis.

^{35}

We simulate thousands of scenarios by creating a grid of parameter space to explore, including all values between .05 and .95 in steps of .05 for parameters $pgx$ and $pgx+1$ for $g$$\u2208$ {0, 1}. We exclude values of 0 and 1 to avoid dividing by zero when calculating ratios. We use these values in Eq. (6) to determine when we observe flipped signs (larger effects of arbitrary covariate $X$ in group 1 than group 0 on the odds-ratio scale but smaller effects on the risk-ratio scale, or vice versa). The overall percentage of flips depends on the parameters that we explore (e.g., when we search between .05 and .95 in steps of .05, the overall percentage is 7.4%, versus 6.1% when we search between .001 and .999 in steps of .001; both percentages are lower than the percentages that we observe when considering flipped signs in AMEs versus ORs or RRRs). However, the pattern is the same regardless of the parameters that we explore.

^{36}

Figure 5, panel a, is symmetric across the diagonal (rather than the vertical or horizontal axes as in Figure 2, panel a). We ignore the diagonal, which represents the case when the groups’ outcome probabilities are exactly the same; this case is very unlikely in real data.

^{37}

This nonlinearity (and the asymmetry across the vertical and horizontal axes in Figure 5, panel a) reflects the multiplicative relationship between the odds ratio and risk ratio (see Eq. (5)).

^{38}

We use the phrase *change* in *X*, but in applications where researchers do not observe change in $X$, the term *difference* is preferable.

^{39}

Specifically, they are needed to estimate the probabilities $p\u02dcgr$ for $g$$\u2208$ {0, 1}; see online appendix Eq. (A1.6) (the MER analog to the AME-focused Eq. (3)).

^{40}

We do not limit our target population of models to those estimating causal effects. Rather, we consider all NLPM-based explorations of heterogeneous predictor–outcome associations.

^{41}

We selected the journals *Demography*, the *American Sociological Review*, and the *American Journal of Sociology* because of their status as top-ranked journals in their fields. We did not explore top-ranked economics journals because in our initial research, we found that articles published in those journals tended to use LPMs rather than NLPMs to model categorical outcomes. When discussed, NLPMs were generally reported as consistent with LPMs in a footnote; see our earlier related discussion about LPMs. For a discussion about publications in epidemiology, see Knol et al. (2009).

^{42}

For representative values, we typically use means (for continuous covariates), modes (for categorical covariates), and reference category values for fixed effects not listed. See online appendix section A4 for details.

^{43}

Table A1 in the online appendix reports, for each of our 33 cases, effect heterogeneity on the relative scale (columns 1 and 2) and the absolute scale (columns 3 and 4) and whether the sign flips across scales (column 5). We also describe the two groups (columns 6 and 7) and the outcome and the predictor whose heterogeneous effect is assessed (column 8). Moreover, we explored two additional sets of covariate values (beyond those used in Table A1; see appendix section A4) for four of the five cases in which we identified flipped signs in Table A1 (the fifth case stems from a model without covariates). We found that across all three sets of covariate values, three of the four cases evidenced sign flips twice (once using our original representative values plus once using one of our two alternative sets of values), while one case evidenced flipped signs once (using our original representative values).

^{44}

Outcome probabilities serve as weights when translating effects from the relative, odds-ratio scale to the absolute, probability scale (see Eqs. (2) and (3)); thus, when outcome probabilities differ substantially across groups, they can reverse relative group differences. Analogously, subpopulation shares serve as weights when combining subpopulation outcome–predictor associations into a full-population association (depending on the association’s metric, other statistics may be involved, like subpopulation means and variances; further, some associations are not collapsible, as mentioned earlier for odds ratios); when these shares differ substantially, Simpson’s paradox can result.

^{45}

In linear models (unlike in NLPMs), coefficients (or simple combinations of coefficients, such as sums of coefficients on main-effect and product terms) are marginal effects. These marginal effects (or group differences therein) capture effect heterogeneity on the absolute scale of the outcome variable.

## References

*p*< 0.05