Mar 19 2010

Statistics in Science

Tom Siegfried wrote an article on the use of statistics in science that is simultaneously excellent and frustrating. It is an excellent review of common errors in using and thinking about statistics in science. But it is frustrating because Siegfried frames his article as a problem with “science” – as if his criticisms are criticisms of science itself, rather than the failings of individuals. At times he also writes as if the problems with statistics that he points out are fatal flaws or “some” researchers are just now starting to take into consideration.

Rather, from my perspective statistics are a very complex and challenging field. Most scientists I know have a moderately sophisticated understanding of statistics, many know far more of statistics than I, although some apparently less. However, most large studies consult with statisticians who are experts in statistical analysis. The primary difficulty is with interpreting studies once they are published (even when the paper itself gets the statistics correct).

But the specific problems Siegfried points out have been widely known for years, and those researchers with a better understanding of statistics have been taking them into account for as long as I can remember. Most of them I learned about in medical school at the hands of researchers and experts in public health.

But all that aside – Siegfried does provide an excellent review of common mistakes interpreting statistics in research, especially medical research. The article is worth a thorough read, but I will give some highlights.

The first and most common error is the misinterpretation of the meaning of statistical significance, often stated mathematically as the p-value. Often the p-value is described as the probability that the results of the study were due to chance alone, so a p-value of 0.05 means that there is only a 5% chance that the study results are a false positive. But this is not an accurate description of p-value.

Rather, the p-value says that if we assume the null hypothesis (no effect) what are the odds of getting the results of the study or greater. This may seem like a subtle difference (and it is) but it’s important. This is not the same thing as saying there is a 95% chance that positive results reflect a real effect.

First, statistical significance does not account for the rigor and quality of the study. It assumes no bias or flaws, and it does not account for other statistical flukes that could alter the results (such as poor randomization – see below).

But most importantly, the p-value of an individual study was never meant to be the final arbiter of what is true in science. Kimball Atwood has already written an excellent review of this question over at Science-Based Medicine. You should give that a read as well – but quickly, the point is that a Bayesian analysis is more appropriate. In other words, we begin with a prior probability of a claim being true based upon all existing research. We can then add to that the results of the current study to arrive at a post-probability. So a study with a 95% significance may still only increase the probability of a treatment working from 5% pre-probability to 10% post-probability.

Another way to look at this is that you cannot interpret a single study in terms of whether or not a treatment works, and Siegfried makes this point as well.  You have to put in into the context of prior research (sound familiar?).

There are other common problems in statistic as well. Siegfried points out the common problem of multiple analysis – if you look at 20 variables, one of them will achieve a p-value of 0.05 on average even if we assume the null hypothesis. But this is an old problem, long ago solved by using statistics designed to account for multiple analysis. In fact readers of this blog and SBM have likely encountered this before as a criticism of uncritical analysis of some studies. The take home lessons is – always ask yourself, how many different comparisons did the researchers do (different variables, different points in time, different outcome measures) and did they cherry pick those that were positive.

Next up is randomization – this is an important aspect of clinical trials. Randomization means that people were assigned at random to either the treatment group or the control group. The purpose of this is to avoid selection bias, but also to average out as many variables as possible. So you want to get equal numbers of people with red hair in each group, and randomization should take care of that.

However, Siegfried points out that there is no guarantee that randomization will do this – it may be unlikely, but you can still flip 10 heads in a row. If you think about all the unknown variables, chances are some of them will be unequal in the two groups.

This is exactly why we are so concerned with the size of trials – how many subjects were in the trial. Because randomization gets more and more effective with larger and larger numbers (you may flip heads 10 times in a row, but not a thousand times). Multiple trials also help – chances are random flukes won’t be the same across multiple trials.

Further, there is a process called stratification – with known variables, like age, sex, and race, you can make sure equal numbers get into each treatment group and not rely upon randomization. (But we have to rely on randomization for unknown variables.)

Conclusion

Statistical analysis is just another tool of modern science – it is part of the technology of science. And like all things, there is a wide variation in quality and understand across individuals, and what filters down to the public is generally oversimplified to the point of being wrong.

So I applaud efforts to educate the public about the proper use of statistics, and to educate scientists and professionals for quality control. But Siegfried could have framed his article more as – here are some common mistakes to avoid and how to fix them, rather than – science is broken.

I admit this can be challenging. I lecture on how to interpret the medical literature, where I cover many of these points, and often I get questions from the audience such as – “So you’re saying that most of science is wrong?” When actually what I am saying is that many individual studies are wrong, and studies are often misinterpreted. And further you have to base conclusions on the literature, not individual studies.

But when the technology of scientific studies is used properly, they work just fine.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]
Print This Post Print This Post

17 responses so far

17 Responses to “Statistics in Science”

  1. ccbowerson 19 Mar 2010 at 10:19 am

    We should limit or use of the word science as if it is separate entity that acts on its own. It does not “succeed” or “fail”…although people (scientists) may. I see this used by people both promoting science and criticizing science, and I think it is usually misleading.

    Statistics is a very complex area of science and mathematics, and I disagree with you a bit. I think much of the problem is that many scientists themselves don’t understand statistics well enough, because of how they view it. Statistics is sometimes viewed as an obstacle to get to the science, but in reality its part of it.

  2. superdaveon 19 Mar 2010 at 10:31 am

    The problem is that new software packages have made it easy to apply statistical tests even in situations where understanding them is difficult, or they might not be needed at all. Many people have this notion that. Because it is so easy, many people just tack this sort of thing onto the end of papers.
    The thing to keep in mind is that each statistical test looks for different aspects of the data and you have to understand what they do in order to determine if the test is appropriate and if the results will be meaningful.

  3. tmac57on 19 Mar 2010 at 11:03 am

    Thanks for your continued effort to help the average person out here understand how science works.
    Maybe a common sense synopsis of the understanding of the statistical significance of a study should be linked to every mainstream report of such study, like the disclaimers on drug ads.

  4. gcolleron 19 Mar 2010 at 11:34 am

    If Siegfried’s numbers are true, where the vast majority of studies (9 of 10?) interpret the statistics incorrectly, then it may actually be fair to say more generally ’science’ is getting it wrong.

    Of course it is people who are making the mistakes (who else?) but if the people make up an institution of common methods, teachings and beliefs then you can label the institution as failing in some respect.

    Good article and critique though. Thanks.

  5. Steven Novellaon 19 Mar 2010 at 3:04 pm

    gcoller – but again you are looking at individual studies. No matter what the authors say (even if 100% of the time they get the analysis wrong) post publication the data will be reviewed, and eventually it is systematically reviewed. It is the academic experts who eventually get the analysis correct and put it into perspective for everyone.

    So, you have to ask – is the scientific community working, and in my opinion yes. As opposed to the CAM community, which is broken and not working.

  6. BKseaon 19 Mar 2010 at 5:31 pm

    I would argue that scientists don’t generally get this wrong, but that statements regarding the interpretation can easily be misleading to non-scientists. For example, even if the Wakefield Lancet study was legitimate, a proper response would have been “that’s interesting, we should study whether that effect is real,” not “alert the media – stop all MMR vaccinations immediately.” A scientist realizes that there is a high probability that this study is a false positive because there are thousands of researchers studying various aspects of Autism and most positive results will turn out to be red herrings. On the other hand, when a Phase III clinical trial shows statistical significance, the scientist realizes that there were bench studies and Phase II trials suggesting a high prior probability of effect and thus the chance of a false positive is quite low. The problem is that non-scientists often don’t understand the backstory. So, when you say this study has P<0.05 and that study has P<0.05, it doesn't always mean the same thing.

  7. ccbowerson 20 Mar 2010 at 12:36 am

    The difficulty with interpreting medical literature, versus looking at individual studies, is that it requires some expertise and perspective on the literature. This is difficult enough for people in a given field… what should we expect of scientists in a different field? Non-scientists? the general public? the media?

    Admittedly, the media have other conflicting interests besides conveying information: they need to draw and audience and sensationalism sells. Also, the average person is only going to remember the headline of a big health story (if that).

    The key would be to have people actually interested in the substance of the science being discussed. It would help if critical thinking or skepticism were a required part of our education systems. Since it is not possible for everyone to be an expert in everything, ciritical thinking tools should be taught that can be used across different subjects. Also, learning how to determine who the experts are in a given field, and having some respect for their knowledge and opinions would go a long way to helping people to sort through it all.

    All of these problems end up in a failure to communicate effectively (problems with both sending and receving the info). Even when great science is taking place, the inability to communicate this accurately to the public is detrimental to the process.

  8. Draalon 20 Mar 2010 at 8:10 am

    Review papers are a good starting point to begin at when learning about a topic that’s new to you. Then look at the cited papers; who authored them and where they were published. A good indicator of who’s who in any particular field is to look at the number of times an author’s paper(s) are cited.

  9. HHCon 20 Mar 2010 at 11:48 am

    In a prior posting, Dr. Novella stated that science’s conclusions were more stingent than the judgments or rules for evidence than the legal profession. I disagree with this assumption because the judgments of court cases are stringently reviewed for logical flaws. Court cases are dismissed or overruled. Expert witnesses rely on the proper usage of the tools of a trade,e.g. DSM-IV or major accepted research findings in a field of knowledge. Human decision-making be explained using information integration theory. a psychological science-based theory. If the evidence is flawed, such as conclusions drawn from poorly performed statistical analyses, it is the responsibility of the expert witness,e.g., the independent forensic expert to present this to the court. That is why there are oaths taken in the court room to present the truth, and when the truth is puposely distorted, there are perjury charges against the person(s) who make such false claims.

  10. Shelleyon 20 Mar 2010 at 6:41 pm

    There’s so much to comment on, but I’ll just make a couple of quick points:

    Here’s the most critical issues from my perspective:

    “Another way to look at this is that you cannot interpret a single study in terms of whether or not a treatment works, and Siegfried makes this point as well. You have to put in into the context of prior research (sound familiar?).”

    Does the study converge with what we know from other research? Surprising results (though they make for great public interest and are highly newsy) rarely make much of a buzz in scientific circles. Science proceeds in a relatively orderly fashion, with study building upon study. Convergence of data is critical, and a much neglected and maligned section of the study is the introduction: How does this hypothesis fit with what we already know???? People frequently skip through this read, but it sets up the argument.

    In terms of the results, do the findings make sense given what we already know? What is the mechanism by which this variable works?

    There are good statistical techniques for correcting for multiple analyses, so when we see multiple analyses, we look for corrections for it. It isn’t that complicated, really.

    I recommend ‘Statistics as Principled Argument’ by Abelson: It’s a good primer on detecting “fishiness” in research, both of the statistical and methodological sort and it isn’t an onerous read.

    “Statistics are like condoms. Nobody likes to use them, but no one will do science with you if you don’t.” Author unknown.

  11. ccbowerson 20 Mar 2010 at 7:06 pm

    “That is why there are oaths taken in the court room to present the truth, and when the truth is puposely distorted, there are perjury charges against the person(s) who make such false claims.”

    Sounds a little naive. Perjury charges are pretty hard to prove if the person doing the distortion is careful enough (not that I think there is a flaw in this). The truth is distorted in every court in the land. Not that there is anything wrong with that either, its a part the process ,and it is ultimately through the process that the truth comes out.

    That is the theory anyways, and the scientific process is similar in that it is about the soundness and rigorousness of the process rather than whether there are any errors or deceit along the way.

  12. HHCon 20 Mar 2010 at 7:42 pm

    ccbowers, I agree process is key. Strategic moves always win, naive as it may sound.

  13. Shelleyon 21 Mar 2010 at 10:22 am

    On randomness: Lack of randomness is the pet critique of the first year stats student. I’ve read many, many research articles and I can’t think of any in which true randomness has been achieved.

    This is not such a big deal, however, provided the researchers do not make broad claims about populations not sampled and variables not examined.

    Probably the variable that is least frequently included (or controlled for) in research but which is likely to have a significant impact on almost anything studied is socioeconomic status. Income matters.

    I agree with Dr. Novella: The problem with the article he cites is that there is nothing in it for the sophisticated consumer of research material who has a fairly good background in statistics and research methods (and who already knows the standard critiques). However, it does gives the casual reader the false idea that research is inherently flawed.

    Consequently, the casual (and entirely uninformed) reader sees the title “Odds are it’s wrong,” and feels completely vindicated in the opinion that research is useless and unreliable.

  14. Brutuson 22 Mar 2010 at 11:37 am

    I enjoyed reading this blog and agree with most of it but have one picky little correction. Novella states: “Rather, the p-value says that if we assume the null hypothesis (no effect) what are the odds of getting the results of the study or greater. This may seem like a subtle difference (and it is) but it’s important.”

    The p value is a probability, not odds. “This may seem like a subtle difference (and it is) but it’s important.”

  15. BillyJoe7on 23 Mar 2010 at 6:00 am

    Maybe not such a subtle difference.

    Probability of heads = (number of heads)/(number of heads + number of tails) = 1/2

    Odds of heads compared to tails = number of heads/number of tails = 1/1

  16. John2on 26 Mar 2010 at 3:26 am

    One of the first things that I was taught in my physics degree was about the estimation and propagation of errors in any experiment. The point was hammered home that the error bars are absolutely critical in any report, and that if you ended up with too many or too few touching the line of best fit that it suggested that at the very least you should check them again.

    It became a game to be played at conferences (my work was on the LHC), where if a speaker got to the end and presented a graph with, say, 100 points all of which had error bas touching the line, someone would raise their hand and say “your error bars are too big”. If the last hour had been explaining the experiment you’d often see a rising panic as the speaker thought “It took me a year to work these out, how’s that guy over there been able to take in everything in the talk and say that I have my numbers wrong?”

  17. jcbmackon 13 Jul 2010 at 9:58 pm

    I like your blog and I have been a longtime lurker. I read Tom’s article twice prior to responding in its entirety. I do not think he is so much criticizing science as a whole since the quotes from several statisticians at the end state the opposite. Also Tom shows the various weaknesses in statistical techniques and their application as well. On another related note: in order to understand drug company published studies on a particular new drug or claims of any researcher for that matter it takes more than a medical degree, clinical experience and using stats correctly, or in a sense it takes less than that; it takes a solid knowledge of Organic and Biochemistry and not just what they teach in medical school which quite honestly is less than what they teach in full courses in undergraduate and graduate school in the first place. Now, of course clinical experience and a detailed history are two of the most important factors in assessing individual patient cases in the first place in addition to being an M.D. and being well trained in the first place, however, a real solid Biochemistry background is what is needed to interpret drug pathway studies and assess analysis of drug outcomes benefits from such knowledge. Statistics is not well undertsood by most medical doctors and can be misapplied by scientists too. Still it is not the doctor’s job to do in depth statistics analysis; I know many board certified attendings who do not know the difference between N and n, or, on how to actually apply Bayes at all. Yet Harrison’s Principles of Internal Medicine talks about stats and Bayes in considerable detail.