“There are three kinds of lies: lies, damned lies and statistics.” Attributed to Disraeli by Mark Twain (Mark Twain’s Own Autobiography: The Chapters from the North American Review)
“Everyone is a statistician, the only question is whether a good or bad one.” – Unknown
Despite the remarkable benefits from Deep Brain Stimulation (DBS), there are still significant technological and efficiency challenges facing lead implantation surgery that limit patient access to DBS. Fortunately, there have been remarkable technical advances in surgical approaches that buoy hope for the future. However, further development and validation of advanced methods are also challenging.
How does one demonstrate that a new surgical approach is better? There are many facets to this question. For example, two methods may provide the same clinical outcomes but one may be more efficient and less expensive, and is ultimately favored. However, the key phrase is “the same clinical outcomes”. How does one demonstrate that the outcomes are the same between competing approaches? This important question rarely is answered correctly in the published literature. Factors include insufficient sample size, inappropriate controls or comparison groups, and wrong statistical approaches.
The appropriate statistical approaches are not whether there is a statistically significant difference between the outcome measures of competing surgical techniques. Given a sufficiently large sample size, even small meaningless differences may be demonstrated. Given a small sample size, even large meaningful differences may be missed (a type II error). Even relatively small sample sizes can demonstrate statistically significant differences when none truly exists (a type I error), though it would be rare. As statistically significant differences generally are taken as meaningful, a type I error can cause serious confusion. Often in statistics, it is possible to determine the probability of a type II error which can help provide confidence.
The probability of a type II error is inversely related to the magnitude of the effect (how much better is one method compared to the other), the number of subjects studied (sample size), and the variability (inverse of repeatability) of the measures of the outcome. The degree of confidence of the probability of a type I or type II error depends greatly on the question being asked. For example, one question, for research purposes, is whether there is sufficient confidence to invest more in continuing to study the alternative surgical methods (for example, as a proof of concept). Alternatively, there is the clinical question of whether there is sufficient confidence to accept the alternative method as clinically useful. The choice as to which question to answer greatly influences the statistical methods that are valid. Unfortunately, in the majority of published studies, which question is being asked is not clear and risks confusion as to what the appropriate statistical methods should be. If the question is whether research should continue, the degree of confidence may be much lower, because being too conservative might result in a promising surgical approach being abandoned prematurely. However, if the question is to demonstrate clinical utility, then a much higher degree of confidence may be required.
If one wishes to make any inferences as to the clinical utility of an alternative surgical approach, the typical statistical designs are based on demonstrations of equivalence, non-inferiority, or superiority. Demonstration of equivalence is analogous to a two-tailed statistical test, whereas non-inferiority or superiority are analogous to a single-tailed test. In either event, it is first necessary to establish, even before the research is conducted, what is the effect size, such as a difference in outcomes, that is clinically meaningful. Then one determines the degree of confidence needed for accepting or rejecting a result as clinically meaningful. The answer to this question generally turns on the consequences, in the widest connotation, of electing to perform or not perform the surgery. Typically the degree of confidence relates to the probability of a type I or type II error, which also depends on the variability of the outcome measure. One de fault probability is 80% of not having a type II error. With the acceptable degree of confidence, determined by establishing a cutoff probability, the necessary sample size is calculated, even before the research starts.
Consider DBS for Parkinson’s disease using the Unified Parkinson Disease Rating Scales (UPDRS) motor examination, where the mean score in untreated patients in a study was 43 with a standard deviation of 13.5. Assume that a clinically meaningful change is 5 points. In order to have an 80% probability of not having a type II error at a p < 0.05, the sample size would have to be 230 subjects (115 subjects subjected to each of the two competing surgical methods). To this author’s knowledge, no study comparing DBS lead implantation with and without the use of microelectrode recordings achieved the necessary sample size to have confidence in the results.
It is problematic to randomize the subjects of individual surgeons to DBS with or without microelectrode recordings, hence the rarity of such studies. Consequently, studies compare the outcomes of one surgeon using one method to another surgeon using the alternative method. However, this presumes that all other considerations are equal and thus, the only determining factor is the choice of surgical method. However, this would require quite a leap of faith, and at the least must engender skepticism.
Another option is to compare the surgical outcomes of a surgeon using one method, with the outcomes of previously published studies (historical control), as often has been the case. However, this is problematic for a number of reasons. First, the surgeries where not done in the same time frame and thus, older studies may not have the advantage of improved technology or experience. Second, as surgeons may vary in skill and experience, it is very difficult to control for this potential confound. If the surgeon or surgeons reporting their experience with the alternative surgical method are more skilled than the average surgeon in the historical control group, then the historical control average will be an underestimate of what the surgeon or surgeons using the alternative approach should achieve and there would be a false sense that the alternative surgery is superior. These challenges can lead to type I errors.
The best design is a within surgeon design where the subjects for any particular individual surgeon is randomized between competing surgical approaches. However, such studies are difficult to conduct because of the difficulty of achieving equipoise on the part of the surgeon. Typically, surgeons believe that their choice of method makes a difference and they have underlying reasons for their choice. It may be difficult for the surgeon to perform an alternative to his/her surgery of choice. Further, it would be very difficult to blind the surgeon as to the method and thus, prior presumptions of the superiority of the surgeon’s original choice of methods may bias the results. Again, this could lead to type I errors.
One could argue that it is reasonable to publish results, even based on insufficient sample size, recognizing that the results are “preliminary”. However, there are at least two reasons for not doing so. First, results based on insufficient sample size, particularly when making a claim of no difference such as a demonstration of equivalency or non-inferiority, are uninterpretable. At the very least, such reports are a waste of resources, and at worst would be given underserved credence with potential inappropriate consequences. Also, one might argue that data collection continues even after publication of preliminary results with subsequent publication of results with a sufficient sample size. However, doing so creates a situation of multiple comparisons (the preliminary and final results), which either risks a type I error due to alpha inflation (roll the dice enough times that the desired outcome will result) or require a considerable increase in the final sample size required.
As said Ludwig Wittgenstein wrote “Whereof one cannot speak, thereof one must be silent.”