This systematic review (2021) argues that de-blinding (breaking blind) in randomized controlled trials (RCTs) of psychedelic therapies is leading to a (not defined/measurable) over-estimation of the outcomes (outside clinical trials). The authors suggest measures to tackle this and to use caution interpreting the existing RCTs.
“There is increasing interest in the potential for psychedelic drugs such as psilocybin, LSD and ketamine to treat a number of mental health disorders. To gain evidence for the therapeutic effectiveness of psychedelics, a number of randomised controlled trials (RCTs) have been conducted using the traditional RCT framework and these trials have generally shown promising results, with large effect sizes reported. However, in this paper we argue that estimation of treatment effect sizes in psychedelic clinical trials are likely over-estimated due to de-blinding of participants and high levels of response expectancy generated by RCT trial contingencies. The degree of over-estimation is at present difficult to estimate. We conduct systematic reviews of psychedelic RCTs and show that currently reported RCTs have failed to measure and report expectancy and malicious de-blinding. In order to overcome these confounds we argue that RCTs should routinely measure de-blinding and expectancy and that careful attention should be paid to the clinical trial design used and the instructions given to participants to allow these confounds to be estimated and removed from effect size estimates. We urge caution in interpreting effect size estimates from extant psychedelic RCTs.”
Authors: Suresh Muthukumaraswamy, Anna Forsyth & Thomas Lumley
There is increasing interest in the potential for psychedelic drugs to treat a number of mental health disorders. However, most psychedelic RCTs have failed to measure de-blinding and response expectancy, and therefore effect size estimates may be over-estimated.
This text discusses the history of the RCT, the use of placebos, masking, expectancy, alliance, reporting, the patient trajectory through an RCT, the use of parallel groups design, dose-response parallel groups design, pre-treatment parallel groups design, balanced placebo factororial design, and conclusions.
The last twenty years have seen a surge of interest in the therapeutic use of psychedelic drugs for a range of psychiatric/mental health conditions. However, the evidence for efficacy obtained from therapeutic RCTs of psychedelic drugs where masking has clearly failed falls short of the evidence obtained in other areas of medicine.
This paper argues that the standard clinical trial designs for gathering evidence for the efficacy of psychedelic drugs should be re-considered, and discusses the nature of expectancy effects, masking, and several statistical models that could be used to estimate treatment effects under different assumptions.
- Causal Inference and the Randomised Controlled Trial
Clinical trials consist of experimental units of observation (participants), treatments and the evaluation of outcomes. It is worth considering the logic of blinding and double-masking.
Making causal inferences from clinical trials within the formalism of Rubin’s causal model requires separating the response variable Y(u) from the cause variable S(u). This is done by measuring the response of an individual unit Y(u) at some time after exposure to S(u).
The fundamental problem of causal inference is that it is impossible to observe both potential outcomes for any individual at any point in time. The statistical approach taken in clinical trials is to quantify the average treatment effect (ATE) over a population of interest U.
This approach does not assume that ITE is constant across U, and instead measures a single response variable alongside an intervention to give a variable pair (S, Ys).
If you randomise enough samples to make S statistically independent to Ys, then the conditional expected values need not be the same as E(Yt) and E(Yc).
The Rubin causal model assumes that sampling of one individual is unaffected by the assignment of other individuals, but this assumption might not hold in group therapy.
Deacon and Cartwright point out that ATE can be easily misinterpreted because it does not apply to every member of the trial sample, and the ATE is more likely to differ from the population it is selected from when small samples are used. Even after meeting internal validity criteria, it is an open question as to how the effect size may apply to the general population. This is particularly true for populations studied in psychedelic drug trials. Sir Austin Bradford-Hill noted that trials show results under careful observation and certain restricted conditions, but not necessarily in general use.
- Brief History of the RCT and Psychedelic RCTs
The gold-standard RCT was developed through cumulative progress over the course of the 20th century to reduce potential biases and strengthen causal inference in clinical research. The RCT was successively adopted into legislative frameworks, and by 1970 it was a requirement of the FDA that new drug applications submit RCT results. In the 1950s, investigations into the medical use of psychedelic drugs exploded, but the last NIMH project using LSD on human participants ended in 1968. The studies rarely met modern standards for RCT design and/or reporting.
- Placebo Responses, Expectancy and Alliance
The placebo response is the therapeutic effect of receiving a treatment that is not caused by any inherent properties of the treatment or due to natural progress of the disease. In trials with objective and/or binary outcomes, placebo groups showed no significant effects over and above those seen in the no-treatment arm. The approach of Hrobjartsson and Gotzche underestimates the placebo effect in clinical practice, and their meta-analytical techniques may be flawed because they do not distinguish between participants who believed they had received the treatment and those who did not.
To understand the placebo effect, one must understand four types of healing: active-treatment induced healing, placebo-induced healing, healing induced by clinician-patient interaction and spontaneous natural healing. The placebo effect is often defined as a combination of placebo-induced healing and healing induced by clinical-patient interaction.
The Rubin Casual Model estimates the placebo response, but not the causal placebo effect, so a no-treatment control group is required to estimate the causal placebo effect.
While the inclusion of an additional no-treatment group might be useful in order to measure an APE, adequate masking must be maintained. The platinum standard trial is an RCT where clinicians and patients are unaware that they are taking part in an RCT. This type of trial has been used in the past to compare placebo to active treatment. In a crossover trial for cancer pain, patients who gave informed consent to participate in the study had a greater placebo response than those who did not.
APE and ATE are defined by row subtraction, but this assumption is uncertain under randomisation conditions, and psychedelic trials may even be unmasked. Previously, interactive models for combining the effects of AT and PE have been proposed . However, even within Strassman’s framework to demonstrate efficacy is still necessary to show that ee for a psychedelic intervention is significantly different to ee for a (successfully masked) placebo.
The placebo effect in an RCT can be explained by the patient’s response to observation and assessment, the therapeutic ritual, and the patient-clinician interaction. The therapeutic relationship between patient and therapist, the healing setting, the rationale, conceptual scheme, or myth underlying the treatment, and the ritual of the treatment itself.
Expectancy is a key contributor to the placebo response in RCTs. In RCTs of standard antidepressants, expectancy is higher in active-comparator trials than in placebo-controlled trials, and in RCTs of psychedelics, expectancy is higher at baseline than during the course of the trial.
A tool called the Credibility/Expectancy Questionnaire (CEQ) can be used to measure expectancy. The CEQ consists of two items: how successful do you think the treatment will be in reducing your depressive symptoms, and how much improvement do you think will occur. In a prospective RCT designed to manipulate expectancy, Rutherford et al found that expectancy improved in the open-label group post-randomisation but not in the parallel-group. This indicates that expectancy can be used to alter treatment outcomes.
Although most studies on ketamine in depression have not included psychotherapy, the model of psychedelic medicine being adopted for classical psychedelics such as LSD/psilocybin incorporates a strong element of psychotherapy.
A meta-analysis of 30,000 patients from 295 studies demonstrated that a warm, friendly manner of the practitioner was associated with an enhanced placebo response, and that increased follow-up visit numbers were associated with an increased placebo response in those allocated to placebo arms.
- Masking in Trials of Psychedelic Drugs
The International Council for Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) guidelines state that masking in clinical trials is intended to limit conscious and unconscious bias.
In our recent crossover-RCT assessing antidepressant responses to ketamine (0.43 mg/kg) which included the active placebo remifentanil (1.7 ng/ml), participants guessed correctly 88% of the time and scored their confidence with an average of 7.67/10 (SD=2.12). The other 12% guessed incorrectly with an average 9/10 confidence.
We contend that the estimation of ATE in psychedelic clinical trials is overestimated due to unmasking effects.
In medical RCTs, effect sizes can be exaggerated by 0.56 standard deviations in unmasked trials, and in meta-epidemiological analyses, effect sizes are significantly reduced in masked versus non-masked trials. In psychedelic trials, there is likely to be at least moderate risk of bias.
One potential counterargument to the importance of maintaining masking in trials is that the clinical benefits of psychedelics are so large that RCTs are not needed. However, this is not a new idea in medical practice. Philip’s paradox states that the more potent a therapeutic treatment, the less likely it can be shown in a double-masked trial. However, the large placebo effects seen with psychedelic trials make it difficult to rely on this ‘large-enough’ effect size argument.
To solve Philip’s paradox, Howick considers the nature of what a confounder is and distinguishes between benign and malicious unmasking. Psychedelic drugs create a rich array of psychological and physiological effects.
In ketamine trials, participants primarily identified the drugs for their psychoactive effects, not their beliefs about antidepressant efficacy. If double-masking cannot be successful, open-label trials are more useful for accurately characterising treatment response.
- Assessment of Masking Success in Clinical Trials
The CONSORT 2010 statement recommends that investigators report on any explicit breaks in masking, but not the success of masking. Even with the earlier CONSORT recommendation, masking success was evaluated and reported in clinical trials infrequently. Several arguments were made to remove masking assessment from the CONSORT 2010 statement, but these arguments are less relevant to psychedelic drugs with their clear psychoactive effects. Sackett does not suggest ignoring de-masking effects, but rather that specific confounders be measured and evaluated in a controlled fashion. However, this approach is not realistic for psychiatry.
There is no standard approach for assessing masking efficacy, but a categorical response variable, a five-point Likert scale including a “don’t know” option, and a contingency table can be used.
Two proposed masking indexes are the James Blinding Index (BIJames) and the Bang Blinding Index (BIBang). These indexes can be calculated for Likert-scale data using the conditional probabilities (Pj|i) method.
Weights are assigned such that 0 is assigned for correct guesses, 0.75 for incorrect guesses and 1 for “don’t know” responses. The Blinding Index (BIJames) can range between 0 and 1, with 1 indicating completely successful masking.
In psychedelic trials, masking may not be symmetrical between the trial arms, making BIBang a more appropriate index to use. However, a simple VAS may be useful as a covariate in the analysis of individual outcomes.
When assessing de-masking, it is important to consider when after the intervention to perform this assessment. If some unmasking reflects speculations about efficacy, then earlier assessment may be more appropriate.
- Reporting of Masking and Expectancy in Psychedelic RCTs
We conducted systematic reviews of ketamine and classical serotonergic psychedelic trials to understand the measurement of expectancy, de-masking and alliance in psychedelic trials.
Of the 43 included studies, none measured pre-treatment expectancy of participants, five measured masking, three provided quantitative data, but none used existing masking indexes, and none reported maintenance of masking. Furthermore, only a minority of studies reported a masking assessment, with none reporting successful masking.
Adjunct therapy, Ayahuasca, DMT, HAM-A, HDRS, LSD, MDD, PO, STAI, YBOCS are all used in the treatment of OCD.
Note: 4WSUS, CO, DMT, HDRS, MDD, MADRS, PO, QIDS, TLFB, tx = treatment.
- The Patient Trajectory Through an RCT
Before considering potential solutions to masking/expectancy/alliance issues in psychedelic RCTs, it is important to reflect on the fact that participants do not live in an information vacuum and that the popular media has extensively covered the potential of these drugs to treat mental health conditions. Participants are provided with neutral terms in the RCT advertisement, and may be at their most unwell in their disease time-course when they enroll in the trial.
A triage step is usually involved in an RCT to assess eligibility. The provision of Participant Information Sheets and Consent Forms provides participants with extensive information regarding the interventions and trial contingencies, and may reduce masking even in an active-placebo-controlled trial. Participants doing their own research regarding drug interventions is common, and their knowledge of experimental design can affect their treatment outcomes.
Ballou et al described the environment and cultural context of their trial, which was useful for understanding placebo responses. This practice should be adopted into psychedelic clinical trial reporting.
In-person screening and treatment phase biases can occur due to rater biases, participant biases, post-randomisation attrition bias and carryover effects in crossover trials. Researchers should publish registered reports with peer review including statistical analysis plans prior to the commencement of the psychedelic RCT.
Figure 1 illustrates the biases that can occur in psychedelic RCTs and how potential trial designs might address these concerns.
- Trial Designs for Psychedelic RCTs
AAAAAA = AA(YYtt ) AA(YYcc ) is only true when participants are unable to guess their allocation, and the effect of treatment should be conditioned on the belief that participants have about their allocation.
In psychedelic RCTs, the key experimental challenge is to create contingencies so that participants in the control group believe they are in the treatment group. This is possible by weakening the condition in Equation 14, for example.
Active placebos can help achieve masking in RCTs, and midazolam has been used most commonly. However, it is unclear how effective these active placebos are at maintaining masking, and a meta-analysis showed that the effect size was reduced when using midazolam.
Ketamine is overestimated in trials for MDD because midazolam is not a perfect active comparator, and midazolam’s anxiolytic effect may be relatively short compared to the potential antidepressant effects of ketamine.
Several active placebos have been used in serotonergic psychedelic research, but none appear to have been successful in maintaining masking. Niacin, amphetamines, and low doses of the drug being studied have also been tried.
Active placebos may be insufficient to successfully maintain masking in psychedelic RCTs, and alternative trial designs may be needed alongside mild deception/vagueness in the information about trial contingencies provided to participants.
9.1. Placebo Lead-in Periods – Not recommended
In trials of standard antidepressants, placebo responders are removed from the trial, while non-responders are randomised into the main phase of the RCT. In psychedelic RCTs, the use of placebo lead-in periods might be entirely counter-productive.
The SPCD RCT consists of two intervention phases and is somewhat similar to using a placebo lead-in period. However, it has the same issue with unequal belief attribution across groups as the placebo lead-in trial.
9.3. Crossover trials – Not recommended
Crossover trials are not ideal for psychedelic RCTs because of the potential for carryover, belief allocation, efficacy, arms and sequence, which can counteract any statistical efficiencies derived from the repeated-measures nature of crossover trials.
9.4. Delayed Treatment Design – Not recommended
In a delayed treatment design, all participants are told they will receive either drug or placebo, but the timing of the active treatment is not disclosed. After a delay, the active treatment is introduced into the placebo group.
9.5. Parallel Groups Design with Active Comparator
Standard parallel groups design with an active comparator may still be appropriate, provided that the active comparator has convincing enough psychedelic properties that a reasonable patient given placebo would believe they had been allocated to the treatment.
9.6. Dose-Response Parallel Groups Design
In some implementations, a dose-response RCT can be conceptually similar to the parallel groups design described above, where the active comparator is a low dose of the treatment. However, a dose-response RCT depends on the validity of the assumption that the low dose is not in fact therapeutic.
It may be possible to not inform participants that the trial in fact contains two or more dose level groups by deception or omission. However, a parallel group design with a different drug comparator would generally require that participants be informed of the different drugs that they may receive.
9.7. Pre-Treatment Parallel Groups Design
Similar to the parallel groups design, here active treatment is preceded by a pharmacological antagonist. For classical psychedelics, the 5HT2A re ceptor antagonist ketanserin blocks the psychological effects of both LSD and psilocybin in a dose dependent manner.
A potential therapeutic RCT using pre-treatment could be designed similar to the dose-response design, but not disclose its experimental purpose. This would allow for the examination of specific receptors on ATE generation.
9.8. Balanced Placebo Factorial Design
The balanced placebo design is a 2×2 factorial design with one factor of intervention and one factor of instructions. It involves four conditions: given drug, told drug, given placebo, and given treatment. In psychedelic RCTs, deceptive (or ambiguous) instructions may help to manipulate the participants’ beliefs around which intervention they have received, when asked after the intervention and before outcome assessment. However, the design itself is somewhat inefficient, and a 1:1:1:1 allocation ratio is not necessary.
9.9. Enrichment Factorial Design
A 2 x 2 enrichment design is similar to the balanced placebo design in that it has one factor for the intervention and one factor for the environment. If combined with an active comparator, this design may achieve greater results.
- Statistical Models and Assumptions
In this section we consider a sequence of progressively more realistic scenarios and the analyses they imply.
If blinding is achieved by experimental design, the simple difference in means estimates an average treatment effect, and the standard analysis is valid.
Confounding can be introduced by conditioning on the observed value of B, which is affected by randomisation. The same statistical model can be used to estimate ATE(b), but a causal interpretation is no longer guaranteed simply by randomisation.
We have not considered measurement error in B and E. If possible, consider measurement error models based on separate studies of test-retest accuracy.
This paper has attempted to outline the fundamental logic for how RCTs are used to establish causal relationships between treatments and outcomes. It has suggested several approaches that could improve the evidence for causation in psychedelic RCTs, including measurement of expectancy, assessment of masking, and full pre-publication of trial protocols.
This paper has focused on the role that RCTs might play in establishing a causal role for psychedelics in treating psychiatric diseases. However, it is important to acknowledge that other forms of evidence are used in medicine to establish causation, such as the Bradford-Hill criteria or surrogate biomarkers.
Find this paper
Linked Research Papers
Notable research papers that build on or are influenced by this paperOvercoming blinding confounds in psychedelic randomized controlled trials using biomarker driven causal mediation analysis
This commentary (2023) suggests that causal mediation analysis using objective biomarkers could help establish causal pathways between treatment and outcome, providing greater confidence in the efficacy of psychedelic therapies before they are approved as regular medicines. This cautious approach is recommended to avoid potential drawbacks such as expanding indications based on low-quality evidence and unstable efficacy over time.
Control Conditions in Randomized Trials of Psychedelics: An ACTTION Systematic Review
This systematic review (2023, s=86) of psychedelic RCTs (up to May 2020) finds that for placebo, 61% used an inert control, 20% active comparators (e.g. niacin), 15% both, and 4% a lower psychedelic dose. Of the 21 therapeutic trials, only 3 (14%) compared different amounts of therapy. Most studies were blinded, but less than 20% tested blinding (generally poor).