In the recent Demography article titled “The Effect of Same-Sex Marriage Laws on Different-Sex Marriage: Evidence From the Netherlands,” Trandafir attempted to answer the question, Are rates of opposite sex marriage affected by legal recognition of same-sex marriages? The results of his approach to statistical inference—looking for evidence of a difference in rates of opposite-sex marriage—provide an absence of evidence of such effects. However, the validity of his conclusion of no causal relationship between same-sex marriage laws and rates of opposite-sex marriage is threatened by the fact that Trandafir did not also look for equivalence in rates of opposite-sex marriage in order to provide evidence of an absence of such an effect. Equivalence tests in combination with difference tests are introduced and presented in this article as a more valid inferential approach to the substantive question Trandafir attempted to answer.
In a recent Demography article, Trandafir (2014) made an important contribution to a growing nascent literature (Badgett 2004; Dillender 2014; Dinno and Whitney 2013; Langbein and Yost 2009) on the effects of extending legal recognition of same-sex marriages on opposite-sex marriages. Such research is important because it clarifies that legally recognized same-sex marriage—which is a material social good, legal right, and determinant of health—is not denied to the population of same-sex couples and their children based on a potentially spurious argument that such a denial will protect marriages of opposite-sex couples. The fundamental question here is, Are rates of opposite-sex marriage affected by legal recognition of same-sex marriages? Trandafir attempted to answer this question by examining (1) population marriage rates during an 18-year period in the Netherlands versus a synthetic control while adjusting for a variety of determinants of marriage rates, and (2) the behavior of individuals in a discrete-time event history model of marriage during a 10-year period while adjusting for a variety of determinants of the individual discrete hazard probability for marriage.
Unfortunately, Trandafir only half answered the statistical question of whether rates of opposite-sex marriage are affected by legal recognition of same-sex marriages, because he posed hypothesis tests for only difference in marriage rates and difference in marriage hazards but did not pose hypothesis tests for equivalence in marriage rates and marriage hazards. For example, in his analysis of aggregate rates, Trandafir actually tested whether differences between Netherlands marriage rates and synthetic control marriage rates were different from zero. However, he could have also tested whether the size of these differences was greater than some a priori level of relevance. As Altman and Bland (1995) noted, absence of evidence of an effect—such as the effect of same-sex marriage laws on rates or hazards of opposite-sex marriage—is not the same thing as evidence of the absence (or equivalence) of an effect. Trandafir looked for evidence of difference in marriage rates and did not find it. However, one cannot conclude there is equivalence without looking for evidence of such, and interpreting lack of significance of a difference in marriage rates or in marriage hazards as evidence of the absence of “causal estimates” veers toward the fallacy of “accepting the null hypothesis.”
Why is looking for “evidence of absence” important for policy makers? If policy makers are concerned about whether legalizing same-sex marriage affects rates of opposite-sex marriage, then to be confident that there is no effect, they need sufficiently strong evidence that opposite-sex marriage rates with same-sex marriage laws and without same-sex marriage laws are equivalent. Tests for difference do not provide evidence of equivalence. Equivalence tests do.
The null hypotheses of equivalence tests (H0−), sometimes termed “negativist” null hypotheses (Reagle and Vinod 2003), take the general form H0− : θ ≥ Δ1 or θ ≤ Δ2, where Δ1 and Δ2 are upper and lower researcher-defined a priori levels of tolerance, with Δ1 > 0 and Δ2 < 0. (It is possible that Δ1 = |Δ2|.) By contrast, the more common null hypotheses of difference tests (H0+), termed “positivist” null hypotheses, take the general form H0+ : θ = 0. Equivalence tests, originally motivated by demonstrating equivalence in therapeutic pharmacological effects, were developed in clinical epidemiology using a “two one-sided tests” framework (Anderson and Hauck 1983; Hauck and Anderson 1984; Schuirmann 1987). However, evaluating evidence of equivalence is generally useful to the sciences because it allows the burden of evidence to be shared evenly between demonstrating the existence of a relationship and demonstrating the absence of a relationship.
Combining tests of H0+ (no difference) with test of H0– (no equivalence) gives four possible conclusions, as shown in Fig. 1. These possible conclusions are as follows:
Rejecting neither H0+ nor H0– indicates indeterminacy (i.e., an underpowered test). For example, we are unable to draw conclusions about the effects of same-sex marriage laws on rates of opposite-sex marriage.
Rejecting both H0+ and H0– indicates trivial difference (i.e., the difference is ignorable at the given tolerance). For example, differences in rates of opposite-sex marriage under same-sex marriage laws are found to be ignorable because they are smaller than we have said we care about.
Rejecting H0+, but not rejecting H0– indicates a relevant difference (i.e., a difference that is large enough to fall outside the interval [Δ2, Δ1]). For example, differences in rates of opposite-sex marriage under same-sex marriage laws are found to be relevant because they are large enough to matter.
Not rejecting H0+ but rejecting H0– indicates equivalence (i.e., no difference within the interval [Δ2, Δ1]), as shown in Fig. 2. For example, rates of opposite-sex marriage under same-sex marriage laws are found to be equivalent; no evidence of a difference in rates was found.
Trandafir’s research, and the policy implications drawn from it, would benefit from taking an equivalence testing approach to providing evidence of whether there is an absence of an effect of same-sex marriage laws. Because Trandafir tested only for differences in analysis of individual rates, we do not know if the differences he found (legalized same-sex marriage increasing opposite-sex marriage in the Bible Belt but decreasing opposite-sex marriage in the four largest cities) are relevant or trivial. But by providing conclusions based on combined inference from tests for difference with tests for equivalence, the burden of evidence is divided between evidence for both the existence of and the absence of an effect of same-sex marriage laws on opposite-sex marriage rates. Such an approach has previously been published in the literature on the effects of same-sex marriage legalization of rates of opposite-sex marriage (Dinno and Whitney 2013), albeit using a different form of equivalence test, as described in (Wellek 2010). The population sciences in general, and the policies informed by them, would benefit from this kind of division of the burden of evidence between evidence of effect and evidence of absence of effect.
The overall Type I error is controlled with α, rather than with α/2, because the intervals defined by H01– and H02– do not overlap.