^{1}

Most meta-analyses use the ‘standardised mean difference’ (effect size) to summarise the outcomes of studies. However, the effect size has important limitations that need to be considered.

After a brief explanation of the standardized mean difference, limitations are discussed and possible solutions in the context of meta-analyses are suggested.

When using the effect size, three major limitations have to be considered. First, the effect size is still a statistical concept and small effect sizes may have considerable clinical meaning while large effect sizes may not. Second, specific assumptions of the effect size may not be correct. Third, and most importantly, it is very difficult to explain what the meaning of the effect size is to non-researchers. As possible solutions, the use of the ‘binomial effect size display’ and the number-needed-to-treat are discussed. Furthermore, I suggest the use of binary outcomes, which are often easier to understand. However, it is not clear what the best binary outcome is for continuous outcomes.

The effect size is still useful, as long as the limitations are understood and also binary outcomes are given.

It was a historical event for the field of clinical psychology. In his presidential address to the American Educational Research Association in 1976 in San Francisco, Gene Glass not only coined the term “meta-analysis” but he also introduced the basic ideas of modern meta-analyses (

Glass brought forward two basic ideas that are at the core of modern meta-analyses. The first idea he brought forward was the ‘standardized mean difference’, or what is often called the ‘effect size’. The effect size indicates the difference between two conditions after the intervention in terms of standard deviations instead of actual scores on an outcome instrument. This makes the outcomes ‘standardised’ and therefore they can be compared across studies. The other basic idea of meta-analyses that Glass brought forward was that these standardised outcomes can be pooled across studies, while weighting them based on the size of the samples. This pooling of the standardised outcomes results in one overall estimate of the true effect size across multiple studies.

It is now 45 years ago that these two basic ideas were introduced. The second idea, the pooling of outcomes according to the size of the study, has hardly been disputed since the introduction by Glass. But the idea of the standardised mean difference has been more controversial over the years. In this paper, I will focus on the standardised mean difference. I will discuss whether this is still the best way of indicating the outcomes of interventions or associations between variables or whether it is better to start using binary outcomes instead. I will call the standardised mean difference the ‘effect size’ which is in fact not correct (

It was a brilliant idea to indicate the difference between two groups in terms of the standard deviation of the outcome measure, instead of the actual difference in scores between the groups. This not only allows to compare these outcomes across different studies regardless of the outcome instrument used, but it also gives an indication of the size of the effect. Previous research often only indicated whether the difference between two groups was significant or not. However, that is not very informative and does not say anything about the size of the difference. Whether or not a difference is significant depends on the size of the sample, and even a tiny difference becomes significant when the sample size is large enough. The effect size solved this problem, because it goes beyond significance levels and indicates how large the difference is. Cohen suggested that an effect size of 0.2 should be considered as small, 0.5 as moderate and 0.8 as large (

However, the use of effect sizes also has several important limitations. One important limitation is that it is still a statistical concept. It may indicate the strength of an outcome, but it still cannot say anything about the clinical relevance of the outcome (

One solution to the problem that the effect size is a statistical concept, could be the use of the ‘Minimal clinically important difference’ (MCID;

The effect size has other problems. For example, it assumes that different outcome scales are linear transformation of each other and the standard deviation units are indeed the same across all studies (

The most important problem of the effect size is, however, that it is so difficult to explain what it exactly means to non-scientists. Imagine a patient who considers to accept a treatment and asks the clinician what the chances are to get better after treatment. The clinician will have to say something like “if you get the treatment you will score 0.5 standard deviation lower on the outcome measure than not receiving the intervention”. Of course a patient has no clue for what this actually means, and many clinicians also find it hard to understand what it means.

There are some solutions to this problem. One older solution is to transform the effect size into the ‘binomial effect size display’ (BESD) (

Another way to make the effect size easier to interpret is to transform it into the number-needed-to-treat (NNT). The NNT indicates the number of patients that have to be treated in order to have one more positive outcome than no treatment (or an alternative treatment) (

Binary outcomes are easier to understand than effect sizes. For example, in a trial the researchers can calculate the proportion of participants that respond (for example defined as a 50% reduction in symptoms from baseline to post-test) in the treatment and control group. They can also calculate the proportion of participants who recover completely (for example by scoring below a cut-off on a symptom measure), who reliably improved, or who reliably deteriorated, or dropped out from treatment. These binary outcomes can answer the question of the imaginary patient that we presented earlier very well. The patient will hear an exact chance of getting better after the treatment compared to no treatment.

For example, we recently conducted a meta-analysis of psychotherapies for depression (

So does this solve the problem? Should we all move away from the effect size and instead use binary outcomes? Unfortunately, binary outcomes also have problems. Maybe the most important problem is that outcomes may be best considered as a continuous phenomenon and not as a binary outcome. One can use binary outcomes that are informative, such as response or remission, but that does not solve the problem that in principle outcomes are still continuous. Another problem is that in individual trials binary outcomes have less statistical power to find significant differences between treatment and comparison conditions. Furthermore, there is no way to decide what the best binary outcome is. In many trials on psychological treatments the Reliable Change Index (RCI) is used (

Another problem with reporting the chance of getting better in the treatment and control conditions is that these chances can be very well reported for individual trials, but pooling them in meta-analyses may be problematic. The problem with exact percentages is that when you pool them, the heterogeneity of the outcome is often very high. Heterogeneity indicates the variability in the outcomes of the included studies in a meta-analysis. If heterogeneity is too high that means that the outcomes are too different from each other to allow pooling. And that is typically the case when proportions are pooled. But on the other hand, these outcomes are so important for patients and clinicians, that one could make the case to pool anyway, but always say that the outcomes can vary considerably.

Usually, binary outcomes in meta-analyses are not reported in terms of absolute percentages, because of the high heterogeneity. In most cases binary outcomes are reported in terms of relative outcomes, such as the Relative Risk (

But all relative outcomes do not answer the question of the patients what the chances are of getting better after the treatment. In order to answer that, it cannot be avoided to give the actual chances.

So should we stop using the effect size and instead move to reporting the proportions of participants who improve in the treatment and the control group? I don’t think that is needed. Many studies already give the effect size and one or more binary outcomes. That is probably the best solution.

But we should avoid to obscure outcomes by just saying that a treatment is effective and the effect size is large, moderate or small. Such a statement can mean many different things. A large effect size can still indicate that many people don’t get better, and a small effect size can be a major breakthrough. It is important to add in trials but also in meta-analyses what the effect sizes exactly mean in terms of relative but also absolute binary outcomes.

The author has no funding to report.

The author has declared that no competing interests exist.

The author has no additional (i.e., non-financial) support to report.