How to rate Risk of bias in Observational studies

See this topic in the GRADE handbook: Study limitations (Risk of Bias)

The content below is provided by Gordon Guyatt, co-chair of the GRADE working group

Supplimental reading: GRADE guidelines: 4. Rating the quality of evidence—study limitations (risk of bias) 

Both randomized control trials (RCTs) and observational studies may incur risk of misleading results if they are flawed in their design or conduct – what other publications refer to as problems with “validity”, “internal validity”, “study limitations” and we will refer to as “risk of bias”.

What method-issues to consider when assessing Risk of Bias in Observational studies

  1. Failure to develop and apply appropriate eligibility criteria  (inclusion of control population)
       Under- or over-matching in case-control studies 
       Selection of exposed and unexposed in cohort studies from different populations

  2. Flawed measurement of both exposure and outcome 
       Differences in measurement of exposure (e.g., recall bias in case-control studies)
       Differential surveillance for outcome in exposed and unexposed in cohort studies

  3. Failure to adequately control confounding
       Failure of accurate measurement of all known prognostic factors
       Failure to match for prognostic factors and/or lack of adjustment in statistical analysis

  4. Incomplete follow-up

How to do the assessment, practical aspects

  1. Summarizing risk of bias must be outcome specific
  2. Summarizing risk of bias requires consideration of all relevant evidence
  3. Existing systematic reviews are often limited in summarizing study limitations across studies
  4. What to do when there is only one RCT
  5. Moving from risk of bias in individual studies to rating confidence in estimates across studies
  6. Application of principles


Systematic reviews of tools to assess the methodologic quality of non-randomized studies have identified over 200 checklists and instruments (15-18).  Table 4 summarizes key criteria for observational studies that reflect the contents of these checklists. Judgments associated with assessing study limitations in observational studies are often complex; here, we address two key issues that arise in assessing risk of bias.

What method-issues to consider when assessing Risk of Bias in Observational studies


1) Case series: the problem of missing internal controls

Ideally, observational studies will choose contemporaneous comparison groups that, as far as possible, differ from intervention groups only in the decision (typically by patient or clinician) not to use the intervention.  Researchers will enroll and observe intervention and comparison group patients in identical ways.  This is the prototypical design using what might be called “internal controls” – internal, that is, to the study under conduct.
An alternative approach is to study only patients exposed to the intervention – a design we refer to as a case series (other may use single group cohort).  To make inferences regarding intervention effects, case series must still refer to results in a comparison group.  In many case series, however, the source of comparison group results is implicit or unclear.  Such vagueness raises serious questions about the prognostic similarity of intervention and comparison groups, and will usually warrant rating down from low to very low quality evidence.  For instance, in considering the relative impact of low molecular weight heparin versus unfractionated heparin in pregnant women, we find systematic reviews of the incidence of bleeding in women receiving the former agent(19, 20), but no direct comparisons with the latter.  

Thus, case series typically yield very low quality evidence.  There are, however, exceptions.  Consider the question of the impact of routine colonoscopy versus no screening for colon cancer on the rate of perforation associated with colonoscopy.  Here, a large series of representative patients undergoing colonoscopy will provide high quality evidence.  When control rates are near zero, case series of representative patients (one might call these cohort studies) can provide high quality evidence of adverse effects associated with an intervention.  One should not confuse these with isolated case reports of associations between exposures and rare adverse outcomes (as have, for instance, been reported with vaccine exposure).

2) Dealing with prognostic imbalance

Observational studies are at risk of bias due to differences in prognosis in exposed and unexposed populations; to the extent that the two groups come from the same time, place, and population, this risk of bias is diminished.  Nevertheless, prognostic imbalance threatens the validity of all observational studies.  If the available studies have failed to measure known important prognostic factors, or have measured them badly, or have failed to take these factors into account in their analysis (by matching or statistical adjustment), review authors and guideline developers should consider rating down the quality of the evidence from low to very low.  

For example, a cohort study using a large administrative database demonstrated an increased risk of cancer-related mortality in diabetic patients using sulfonylureas or insulin relative to metformin(21).  The investigators did not have data available, and could therefore not adjust for key prognostic variables including smoking, family history of cancer, occupational exposure, dietary history and exposure to pollutants.  Thus, the study – and others like it that fail to adjust for key prognostic variables - provides only very low quality evidence of a causal relation between the hypoglycemic agent and cancer deaths.

How to do the assessment, practical aspects


1. Summarizing risk of bias must be outcome specific

Sources of bias may vary in importance across outcomes.  Thus, within a single study, one may have higher quality evidence for one outcome than for another.  For instance, RCTs of steroids for acute spinal cord injury measured both all-cause mortality and, based on a detailed physical examination, motor function (24-26).  Blinding of outcome assessors is irrelevant for mortality, but crucial for motor function.  Thus, as in this example, if the outcome assessors in the primary studies were not blinded, evidence might be categorized for all-cause mortality as having no serious risk of bias, and rated down for motor function by one level on the basis of serious risk of bias.

2. Summarizing risk of bias requires consideration of all relevant evidence

Every study addressing a particular outcome will differ, to some degree, in risk of bias.  Review authors and guideline developers must make an overall judgment, considering all the evidence, whether quality of evidence for an outcome warrants rating down on the basis of risk of bias.  

Individual trials achieve a low risk of bias when most or all key criteria are met, and any violations are not crucial.  Studies that suffer from one crucial violation – a violation of crucial importance with regard to a point estimate (in the context of a systematic review) or decision (in the context of a guideline) – provide limited quality evidence.  When one or more crucial limitations substantially lower confidence in a point estimate, a body of evidence provides only weak support for inferences regarding the magnitude of a treatment effect.

High quality evidence is available when most studies from a body of evidence meet bias-minimizing criteria. For example, of the 22 trials addressing the impact of beta blockers on mortality in patients with heart failure most, probably or certainly, used concealed allocation, all blinded at least some key groups, and follow-up of randomized patients was almost complete(27).  

GRADE considers a body of evidence of moderate quality when the best evidence comes from individual studies of moderate quality.  For instance, we cannot be confident that, in patients with falciparum malaria, amodiaquine and sulfadoxine-pyrimethamine together reduce treatment failures compared to sulfadoxine-pyrimethamine alone because the apparent advantage of sulfadoxine-pyrimethamine was sensitive to assumptions regarding the event rate in those lost to follow-up in two of three studies(28).

Surgery versus conservative treatment in the management of patients with lumbar disc prolapse provides an example of rating down two levels due to risk of bias in RCTs(29). We are uncertain of the benefit of open disectomy in reducing symptoms after one year or longer because of very serious limitations in the one credible trial of open disectomy compared to conservative treatment. That trial suffered from inadequate concealment of allocation and unblinded assessment of outcome by potentially biased raters (surgeons) using unvalidated rating instruments (Table 6). 

3. Existing systematic reviews are often limited in summarizing study limitations across studies

To rate overall confidence in estimates with respect to an outcome, review authors and guideline developers must consider and summarize study limitations considering all the evidence from multiple studies.  For a guideline developer, using an existing systematic review would be the most efficient way to address this issue.

Unfortunately, systematic reviews usually do not address all important outcomes, typically focusing on benefit and neglecting harm.  For instance, one is required to go to separate reviews to assess the impact of beta blockers on mortality(27) and on quality of life(30).  No systematic review has addressed beta-blocker toxicity in heart failure patients.  

Review authors’ usual practice of rating the quality of studies across outcomes, rather than separately for each outcome, further limits the usefulness of existing systematic reviews for guideline developers.  This approach becomes even more problematic when review authors use summary measures that aggregate across quality criteria (e.g., allocation concealment, blinding, loss to follow-up) to provide a single score.  These measures are often limited in that they focus on quality of reporting rather than on the design and conduct of the study(31).  Furthermore, they tend to be unreliable and less closely correlated with outcome than individual quality components(32-34). These problems arise, at least in part, because calculating a summary score inevitably involves assigning arbitrary weights to different criteria.

Finally, systematic reviews that address individual components of study limitations are often not comprehensive and fail to make transparent the judgments needed to evaluate study limitations.  These judgments are often challenging, at least in part because of inadequate reporting: just because a safeguard against bias isn’t reported doesn’t mean it was neglected(35, 36).  

Thus, although systematic reviews are often extremely useful in identifying the relevant primary studies, members of guideline panels or their delegates must often review individual studies if they wish to ensure accurate ratings of study limitations for all relevant outcomes.  As review authors increasingly adopt the GRADE approach (and in particular as Cochrane review authors do so in combination with using the Cochrane risk-of-bias tool) the situation will improve.

4. What to do when there is only one RCT

Many people are uncomfortable designating a single RCT as high quality evidence.  Given the many instances in which the first positive report has not held up under subsequent investigation, this discomfort is warranted.  On the other hand, automatically rating down quality when there is a single study is not appropriate.  A single, very large, rigorously planned and conducted multi-centre RCT may provide evidence warranting high confidence.  GRADE suggests especially careful scrutiny of all relevant issues (risk of bias, precision, directness, publication bias) when only a single RCT addresses a particular question.

5. Moving from risk of bias in individual studies to rating confidence in estimates across studies

Moving from 6 risk of bias criteria for each individual study to a judgment about rating down for quality of evidence for risk of bias across a group of studies addressing a particular outcome presents challenges. 
We suggest the following 5 principles: 

  1. Judicious consideration
    In deciding on the overall confidence in estimates, one does not average across studies (for instance if some studies have no serious limitations, some serious limitations, and some very serious limitations, one doesn’t automatically rate quality down by one level due to an average rating of serious limitations).  Rather, judicious consideration of the contribution of each study, with a general guide to focus on the high quality studies (as we will illustrate), is warranted.  

  2. Evaluate how much each trial contributes
    This judicious consideration requires evaluating the extent to which each trial contributes toward the estimate of magnitude of effect.  This contribution will usually reflect study sample size and number of outcome events – larger trials with many events will contribute more, much larger trials with many more events will contribute much more.

  3. Be conservative when rating down
    One should be conservative in the judgment of rating down.  That is, one should be confident that there is substantial risk of bias across most of the body of available evidence before one rates down for risk of bias.  

  4. Consider the context
    The risk of bias should be considered in the context of other limitations.  If, for instance, reviewers find themselves in a close call situation with respect to two quality issues (risk of bias and, say, precision) we suggest rating down for at least one of the two.

  5. Be explicit
    Notwithstanding the first five principles, reviewers will face close-call situations.  They should both acknowledge they are in such a situation, make it explicit why they think this is the case, and make the reasons for their ultimate judgment apparent. 

6. Application of principles

In a systematic review of flavonoids to treat pain and bleeding associated with hemorrhoids(37), with respect to the primary outcome of persisting symptoms, most trials did not provide sufficient information to determine whether randomization was concealed, the majority violated the intention-to-treat principle and did not provide the data allowing the appropriate analysis (Table 7), and none used a validated symptom measure.  On the other hand, most authors described their trials as double-blind, and although concealment and blinding are different concepts, blinded trials of drugs are very likely to be concealed(35) (Table 7).  Because the questionnaires appeared simple and transparent, and because of the blinding of the studies, we would be hesitant to consider lack of validation introducing a serious risk of bias.

Nevertheless, in light of these study limitations, one might consider focusing on the highest quality trials.  Substantial precision would, however, be lost (requiring rating down for imprecision) and the quality of the trials did not explain variability in results (i.e. the magnitude of effect was similar in the higher and lower risk of bias studies).  Both considerations argue for basing an estimate on the results of all RCTs.

In our view, this represents a borderline situation in which it would be reasonable either to rate down for risk of bias, or not to do so.  This illustrates that the great merit of GRADE is not that it ensures consistency of conclusions, but that it requires explicit and transparent judgments. Considering these issues in isolation, and following the principles articulated above, however, we would be inclined not to rate down for quality for risk of bias.

Three RCTs addressing the impact of 24-hour administration of high dose corticosteroids on motor function in patients with acute spinal cord injury illustrate another principle of aggregation(24-26).  Although the degree of limitations is in fact a continuum (as Figure 1 illustrates), GRADE simplifies the process by categorizing these studies – or any other study – as having “no serious limitations”, “serious limitations”, or “very serious limitations” (as in Table 5). 

The first of the 3 trials (Bracken in Figure 1), which included 127 patients treated within 8 hours of injury, ensured allocation concealment through central randomization, almost certainly blinded patients, clinicians, and those measuring motor function, and lost 5% of patients to follow-up at 1 year(24).  The flaws in this RCT are sufficiently minor to allow classification as no serious risk of bias.

The second trial (Pointillart in Figure 1) was unlikely to have concealed allocation, did blind those assessing outcome (but not patients or clinicians), and lost only one of 106 patients to follow-up(26).  Here, quality falls in an intermediate range, and classification as either moderate risk of bias.  The third trial (Odani in Figure 1), which included 158 patients, almost certainly failed to conceal allocation, used no blinding, and lost 26% of patients to follow-up, many more in the steroid group than the control group(25).  This third trial is probably best classified as having very serious risk of bias.

Considering these three RCTs, should one rate down for risk of bias with respect to the motor function outcome?   If we considered only the first two trials, the answer would be no.  Therefore the review authors must decide either to exclude the third trial (thereby only including trials with few limitations) or include it based on a judgment that overall there is a low risk of bias (since most of the evidence comes from trials with few limitations) despite the contribution of the trial with very serious limitations to the overall estimate of effect.  This example illustrates that averaging across studies will not be the right approach.

_____________________________

GRADE guidelines: 4. Rating the quality of evidence—study limitations (risk of bias)
(The GRADE workinggroup official JCE series)

"Study limitations in observational studies 
Systematic reviews of tools to assess the methodological quality of nonrandomized studies have identified more than 200 checklists and instruments. Below we summarizes key criteria for observational studies that reflect the contents of these checklists. Judgments associated with assessing study limitations in observational studies are often complex; here, we address two key issues that arise in assessing risk of bias.

1. Failure to develop and apply appropriate eligibility criteria (inclusion of control population)
Under- or overmatching in case–control studies
Selection of exposed and unexposed in cohort studies from different populations

2. Flawed measurement of both exposure and outcome
Differences in measurement of exposure (e.g., recall bias in case–control studies)
Differential surveillance for outcome in exposed and unexposed in cohort studies

3. Failure to adequately control confounding
Failure of accurate measurement of all known prognostic factors
Failure to match for prognostic factors and/or lack of adjustment in statistical analysis

4. Incomplete follow-up

7.1 Case series: the problem of missing internal controls 
Ideally, observational studies will choose contemporaneous comparison groups that, as far as possible, differ from intervention groups only in the decision (typically by patient or clinician) not to use the intervention. Researchers will enroll and observe intervention and comparison group patients in identical ways. This is the prototypical design using what might be called “internal controls”—internal, that is, to the study under conduct.

An alternative approach is to study only patients exposed to the intervention—a design we refer to as a case series (others may use “single group cohort”). To make inferences regarding intervention effects, case series must still refer to results in a comparison group. In many case series, however, the source of comparison group results is implicit or unclear. Such vagueness raises serious questions about the prognostic similarity of intervention and comparison groups and will usually warrant rating down from low- to very low-quality evidence. For instance, in considering the relative impact of low–molecular weight heparin vs. unfractionated heparin in pregnant women, we find systematic reviews of the incidence of bleeding in women receiving the former agent [20], [21] but no direct comparisons with the latter.

Thus, case series typically yield very low-quality evidence. There are, however, exceptions. Consider the question of the impact of routine colonoscopy vs. no screening for colon cancer on the rate of perforation associated with colonoscopy. Here, a large series of representative patients undergoing colonoscopy will provide high-quality evidence. When control rates are near zero, case series of representative patients (one might call these cohort studies) can provide high-quality evidence of adverse effects associated with an intervention. One should not confuse these with isolated case reports of associations between exposures and rare adverse outcomes (as have, for instance, been reported with vaccine exposure).

7.2 Dealing with prognostic imbalance 
Observational studies are at risk of bias because of differences in prognosis in exposed and unexposed populations; to the extent that the two groups come from the same time, place, and population, this risk of bias is diminished. Nevertheless, prognostic imbalance threatens the validity of all observational studies. If the available studies have failed to measure known important prognostic factors, have measured them badly, or have failed to take these factors into account in their analysis (by matching or statistical adjustment), review authors and guideline developers should consider rating down the quality of the evidence from low to very low.

For example, a cohort study using a large administrative database demonstrated an increased risk of cancer-related mortality in diabetic patients using sulfonylureas or insulin relative to metformin [22]. The investigators did not have data available and could, therefore, not adjust for key prognostic variables, including smoking, family history of cancer, occupational exposure, dietary history, and exposure to pollutants. Thus, the study—and others like it that fail to adjust for key prognostic variables—provides only very low-quality evidence of a causal relation between the hypoglycemic agent and cancer deaths.

Go to the orginal article for full text, or go to a specific chapter in the article:

Rating down quality for risk of biasRecording judgments about study limitations

See the Assessing Risk of Bias Training video from McMaster CE&B GRADE site:  http://cebgrade.mcmaster.ca/RoB/index.html%EF%BB%BF 

Feedback and Knowledge Base