Taking Stock of the Evidentiary Value Movement at SPSP 2016

The Ethics of Giving Psychology Away at SPSP 2015

What We Want Our Field to Prioritize

Eli J. Finkel and Paul W. Eastwick

TL;DR: Replicability is one of the essential features of a high-quality science, alongside features like discovery, internal validity, external validity, construct validity, consequentiality, and cumulativeness. Yet a given study cannot prioritize all of them at once. The extent to which it is crucial to address a given feature (say, construct validity rather than external validity) in a research area depends on the current strengths and weaknesses of that research area. Consequently, decisions about the relative priority of these features in a given research area at a given point in time should be made by researchers (and editors, reviewers, and readers) thinking critically about that particular research context.

In a blog post titled “What I Want Our Field to Prioritize,” Joe Simmons provides a lucid, forceful case that replicability is the preeminent feature of a high-quality science: “I want to be in a field that prioritizes replicability over everything else.” Consequently, he argues, “the #1 job of the peer review process is to assess whether a finding is replicable.” He concludes as follows: “It would be helpful for those who are resistant to change to articulate their position. What do you want our field to prioritize, and why?”

Joe’s clarity in terms of epistemology and editorial policy is helpful and productive, and although we ourselves are pro-change (i.e., we are not the resistant-to-change folks who are the target of his query; see below for elaboration), we would like to seize the opportunity to engage with him. In our view, there is no question that reform is needed, and yet our epistemological and policy views do not align with Joe’s “Replicability First” perspective.

We begin with a figure from a forthcoming (but already widely circulated) article that we wrote with Harry Reis.

The boxes under “Proximal Means” represent (some of) the features that a quality psychological science should exhibit:

  • Discovery: Do the findings document support for novel hypotheses?
  • Replicability: Do the findings emerge in other samples using a design that retains the key features of the original design?
  • Internal validity: Do the findings permit inferences about causal relationships?
  • External validity: Do the findings generalize across populations of persons, settings, and times?
  • Construct validity: Do the findings enable researchers to correctly link theoretical constructs to operationalizations?
  • Consequentiality: Do the findings have implications or consequences for other sciences and the real world?
  • Cumulativeness: Do the findings cohere in a manner that affords conceptual integration across studies?

We want to be in a field that prioritizes ALL of these features. Nevertheless, doing so is challenging. The following excerpt from our article outlines the inherent tradeoffs:

On Tradeoffs: No Study Can Accomplish Everything, and Resources Are Finite
When considering large collections of studies, it is important to pursue all of the features of a high-quality science. Depending on the context, some features might be prized more than others, but the collection of studies must achieve a reasonably high level of all features to be considered a mature research space. However, as we narrow the focus from a discipline to a topic area to a research program to an individual study, tradeoffs among the features loom ever larger. These tradeoffs emerge for two reasons. First, no single study can accomplish everything. In the wake of a given study, for example, there will always be alternative explanations for the effectiveness of a manipulation (i.e., doubts about internal validity), real-world contexts to which the finding may not generalize (i.e., doubts about external validity), and the possibility that the results capitalized on chance (i.e., doubts about replicability). Second, resources are finite. Each resource (time, money, research participants, etc.) that a scholar invests in a study oriented toward bolstering replicability is a resource that she does not invest in a study oriented toward, say, bolstering internal validity.

In the publication process, editors, reviewers, and readers typically evaluate single studies or a set of studies, not topic areas. In a given article, for every well powered conceptual replication designed to bolster construct validity and connect findings to theory, a well powered direct replication designed to bolster the replicability of particular operationalizations remained unconducted. And vice versa. In this way, we can expect that scholars in a given research space ultimately work to bolster all features of a high-quality science, but single studies/articles will always have to prioritize some features over others.

In his blog post, Joe provides a compelling example of the perils of caring too little about replicability. “Imagine I claim that eating Funyuns® cures cancer. This hypothesis is novel and interesting and important, but those facts don’t matter if it is untrue.” We completely agree; it’s true that a field without replicability is useless. But logic does not require the conclusion in his next sentence: “Concerns about replicability must trump all other concerns.” A replicable discipline filled with nothing but hopelessly confounded or meaningless results—which will itself create a discipline littered with untrue concepts, ideas, and theories—is also useless.

Of course, replicability concerns are substantial, even severe, in many literatures. We need to work to address these concerns head on. Nevertheless, replicability is not the only—or even the preeminent—problem in all literatures.

Here’s an example that hits close to home for us: the ideal standards literature testing the hypothesis that the match between a person’s ideal romantic partner (e.g., reporting a desire for a partner who has high earning potential) and the traits of their current partner (e.g., the partner’s actual earning potential) predicts relationship well-being. To be sure, there are a couple of highly cited recent findings in this domain that need to be replicated. But the most pressing issue in this literature is actually construct validity.

Myriad studies cited as supporting this ideal standards hypothesis use items like, “To what degree does your current romantic partner match your ideal partner for the characteristic ‘sexy’?” The problem with these items is that participants can’t separate these judgments from the extent to which the partner actually possesses the trait in question (e.g., “To what degree is your current romantic partner ‘sexy’?”). Indeed, these two types of judgments tend to correlate at approximately r = .90 (Rodriguez, Hadden, & Knee, 2015). Thus, this analysis strategy cannot rule out the (theoretically irrelevant) alternative explanation that people report greater relationship well-being when they think their partner has positive traits.

In the real world—where resources are limited and every investigation prioritizes some features of high-quality science over others—we learn much less from direct replications of these particular studies than we learn from tests of the hypothesis using proper operationalizations of the key constructs. That is, every resource invested in directly replicating these ambiguous findings is a resource that was not invested in testing the theoretical ideas with psychometrically valid approaches (for a terrific example of how to examine this research question, see Lam, Cross, Wu, Yeh, Wang, & Su, 2016, Study 4).

This example is one among many. Closely related to the name-grade effects that Joe describes, there is Uri Simonsohn’s reanalysis of the name-letter effect, which acknowledges the replicability of this finding but points out problems with its construct and internal validity. Just recently, at a SESP symposium on moral psychology, researchers voiced concerns about (a) cumulativeness, with Linda Skitka noting how moral psychology is an enormous umbrella that includes diverse, unrelated phenomena; and (b) internal validity, with Bertram Gawronski observing that no one had thought to manipulate deontological and utilitarian reasoning (and that doing so reveals radically different conclusions).

Replicability might be a problem in these literatures, too, but we do not believe that these literatures will be well served by the epistemological perspective that they should value replicability over everything else or by the policy perspective that replicability should be the imperial editorial criterion for all empirical contributions. In some research contexts, other desiderata will correctly command higher priority.

We don’t know much about Funyuns®, but we know something about Vosges fair trade dark chocolate bacon bars: Eight dollars, and worth every penny. And it is easy to imagine that there is a correlation between eating Vosges dark chocolate bacon bars and being less likely to die from cancer. But if we were to see a study reporting this correlation, the first thing we would want to know is not whether it directly replicates, but rather whether it holds up in a study that appropriately controls for socioeconomic status.

We haven’t said much about the changes we’re excited about, so we’ll be explicit here: We are excited to see journals deemphasizing flashy, phenomenon-focused, atheoretical findings (a practice that arguably prioritized only discovery while neglecting the other features). We are excited it is easier to publish direct replications and theory-relevant, methodologically solid null findings. We are excited that editors are becoming more likely to publish articles with all their flaws on display (e.g., multi-study packages with a mix of significant and nonsignificant findings) rather than expecting perfection. And we would be excited to see the field rewarding papers that use non-WEIRD samples (to bolster external validity) or the development of real-world applications (to bolster consequentiality).

More broadly, we are in favor of changes that help scholars to identify the weak spot in their particular literature and address it. The weak spot will often be replicability, but it will often be something else. We want to be in a field that allocates resources toward improving multiple features of quality science, not just one.

 

Author Feedback
We shared a draft of this post with Joe to make sure it didn’t have any factual inaccuracies or mischaracterizations. He raised no concerns, but he did write a response (see here).