Feb 03 2010

Biophysical250 – Neurotics-R-Us

I was recently asked my opinion about the Biophysical250 – a series of 250 blood tests offered by a commercial lab for the out-of-pocket cost of $3,400. My skeptical alarms immediately began ringing – I am familiar with the commercial labs promising diagnostic tests directly to the public – generally not a good idea.

I checked out their website, which set off more alarms. The first thing you see, in the upper left corner, is this:

Would you like to get back the vibrancy and passion you enjoyed when you were younger? Are there things you would like to be doing at work or with your family and friends that you don’t have the energy to accomplish?

Umm…yes, please. Probably like every 45 year old, I would love to feel like I did when I was 25 (although honestly I am pretty healthy, except for hypertension, some occasional lower back pain, and I probably have a small rotator cuff tear in my right shoulder).  Will these blood tests fix all that, repair 20 years of wear and tear and rejuvenate my cells? Oh!…darn.

But the implication here is that the average middle-aged “worried well” person – you know, someone with disposable income who can still remember fondly what it felt like to be a 20 something – can recapture their youth and vigor if they could just figure out what mystery ailment afflicts them. There must be something wrong, otherwise they would feel perfect.

For many people there probably are some issues that can be altered to make them feel better – even much better. I see these patients in my office every day. They require some lifestyle adjustments – lose some weight, start exercising regularly, reduce your caffeine intake, and work on your sleep habits. Most people I see do not have nutritional issues – by which I mean they are not malnourished. I do screen for certain targeted vitamin insufficiencies and will treat as needed, but most people don’t require supplements. If anything, Americans suffer from diet excess – too much salt, too much animal fat – but that’s not why they don’t feel well.

(Obligatory caveat – of course many people have real disease, and the whole point of a diagnostic evaluation is to sort out those people who do and treat them appropriately.)

However, most of my patients don’t really need me to tell them they are overweight and they need to exercise, stop smoking, and sleep more. They know it – but that’s hard work. They want me to do a blood test, diagnose syndrome X, give them a pill, and make them feel like they did when they were 20. Some patients are actually disappointed when I tell them that the workup is negative and they are healthy – except for those lifestyle things they need to work on.

That is where Biophysical250 comes in. In fact, there seems to be an entire industry catering to the middle-aged worried well  – complete with custom (and socially acceptable – even fashionable) diagnoses, tests, and treatments.

I should note at this point that I am not being critical. Staying in shape and staving off the effects of aging is hard work, and our culture does not make it easy. I am also (unfortunately) increasingly sympathetic to the tribulations of normal aging. And I am happy to help guide my patients toward feeling better. I am, however, critical of those who try to exploit people who are vulnerable because they do not feel well.

And that is my main problem with Biophsyical – their marketing seems optimized to exploit anxieties and neuroses among the worried well, rather than providing a useful medical service.

The Biophysical250 also falls, in my opinion, into the category of over-screening. Screening for disease, as counter-intuitive as this may seem, is not always a good idea. If you screen low risk groups you may be more likely to have a false positive than true positive, and false positives lead to more testing, perhaps unnecessary procedures and treatments, and anxiety. You may cause more harm than good by blanket screening of low risk groups.

Therefore the standard of science-based medicine is always to do evidence-based screening. Research is done to look at overall outcomes from screening either the entire population or targeted sub-populations for specific diseases. Some screens are good – yes, get you eye pressures checked for glaucoma. Some cause more harm than good – like frequent chest X-rays. And some are controversial – like the recent debate about mammography screening. It’s all about balancing risks and benefits to optimize outcome.

I am not even talking about cost-effectiveness, which is a separate issue. Although cost-effectiveness usually (not always) tracks well with medical effectiveness – preventing costly diseases saves money too.

With all this in mind, it seems highly unlikely that all, or even most, of the 250 blood tests offered by Biophysical meet the criteria for appropriate general population screening. Doing theses tests all at once, rather than separately, is cheaper and more convenient – but who cares if you don’t need the tests in the first place.

What there is a distinct lack of is scientific data showing that people who get the Biophysical250 screen have a net positive health outcome. The company crows about their clients who found problems they did not know were there – but were those problems really problems? What happened when they were treated or further tested? How many were false positive, or would not have caused problems in the first place. And how does the 250 screen compare to the usual tests that their primary care doctor would have run anyway.

In an unfortunately credulous Scientific American article on the topic CEO Mark Chandler was asked about the expense of the test:

Chandler says that plucking out a few of the beads would not be cost-effective, although perhaps a few dozen biomarkers might be enough to catch the most common afflictions and permit a less expensive assessment.

You mean just like is currently done as part of routine screening? That admission was surprising, as it went against all the previous hype. It makes sense to check for the most common diseases, or those that a person is particularly at risk for – not for everything possible, just because the technology exists to do so.

The bottom line is that an extensive screen like the Biophysical250 is probably a waste of time and money. The tests you really need your doctor will order anyway, and your insurance company will pay for. Massive screening like this may not only be worthless, it may cause more harm than good. We need some objective scientific data (which is currently lacking) to really know.

Potential customers should also consider what else they can do with that $3,400. Buy some exercise equipment, a new mattress, or even just take a vacation. They are likely to have more of a health benefit than a battery of unnecessary tests.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]
Print This Post Print This Post

11 responses so far

11 Responses to “Biophysical250 – Neurotics-R-Us”

  1. superdaveon 03 Feb 2010 at 2:19 pm

    At very least, are the tests they offer legitimate diagnostic tools and not outright pseudoscience?

  2. Enzoon 03 Feb 2010 at 6:49 pm

    I took a look at the biomarkers they screen for and some of the blood work they can perform: (http://www.biophysicalcorp.com/pdf/Biomarkers-compare.pdf).

    I would not go as far as to call it pseudoscience. They do, in fact, test many things that medical workups would include. Their cancer panel includes PSA, for example, which is commonly screened for prostate cancer surveillance.

    For the most part it is based on science…But I would say some of the stuff they test may be a bit premature, especially if they are claiming it is soundly correlated to a problem. They avoid saying the panel serves as a diagnosis for anything. Their “sex and energy” panel, for example, is a bit iffy. To their credit they do not seem to offer very preliminary/candidate biomarker analysis on their panels like Abeta for Alzheimer’s.

    Overall I agree with Dr. Novella’s assessment of the service..It is not AWFUL, but probably unnecessary. Most of the important things are already tested for regularly (cholesterol markers, etc.) This is definitely preying on the paranoid diagnosis-seekers.

  3. Marshallon 03 Feb 2010 at 7:08 pm

    Enzo: I do not believe that Dr. Novella is debating whether or not these tests individually are scientifically valid. I’m sure most, if not all, of them are. What he is debating is whether taking such tests en-masse is financially beneficial. It probably is not, due to a few factors: 1) You pay for tests that you almost definitely do not need, and 2) The number of false positives you receive increases with the number of tests, and the reaction to false positives alone probably makes it not worth it.

  4. Enzoon 03 Feb 2010 at 7:52 pm

    Marshall:

    I agree. I was commenting in response to superdave’s question.

  5. eiskrystalon 04 Feb 2010 at 4:38 am

    It is a common scam to take something sciencey with a grain of truth which you then take out of all proportion and sense in order to give it to the worried well. Vitamin supplements, pro-biotics, chelation…same old pattern. Different packaging.

  6. SimonWon 04 Feb 2010 at 11:24 am

    Funnily enough I’ve been thinking about the counter example.

    Which stemmed from my pondering why my GP doesn’t have an ultrasound machine. In the UK the reason boils down to he isn’t paid for that, we pay ultrasound specialists and he would refer. This system prevents capital investments the GP might make into things which might work, but wouldn’t save enough of the doctor’s time to justify. So he has a stethoscope and a thermometer, and very little over and above what his grandfather might have used to make his diagnosis on site (blood or other lab tests being the predominant advance over his predecessors). However I’ve no doubt there are many things that would save the patient time, money and suffering, if it were their interests being considered rather than the GPs, and more that could be invented.

    As a thyroid patient, my particular bugbear stems from problems getting diagnosed. Surveys suggest the average time to diagnosis of a thyroid problem is still about 5 years from onset of symptoms. The symptoms are well documented in the literature, the at risk groups well understood, the diseases are common (>2% of women at any one point), the blood tests relatively cheap. Yet we still see report after report of these disorders being picked up after referral – psychiatrists being a common one – rather than by the front line doctors, or diagnosis after serious complications set in, or at the start of pregnancy (when a test is done!).

    So whilst I agree this kind of screening may be expensive and counter productive. I think there is a good case to argue that we should perhaps do more routine automated tests of patients (at least in the UK). Not necessarily thyroid function screening (the maths says this is not justified in all women), but when I switched GP due to previous doctor over medicating me they did a standard set of lab tests, and I was abnormal on 8 out of 13, due almost entirely to known mild hypothyroid state due to poor prescribing by my previous doctor. These tests are absolutely routine in medicine, I assume they are fully automated after the blood draw. I don’t think it would be that challenging to invent a suitable set of automated tests to be done on patients in the waiting room, including things like weight, blood pressure, pulse, temperature. Ideally non-intrusive tests.

    I’d argue the reasons we don’t do these kinds of simple tests routinely on all patients are more cultural than scientific. In the UK we know precisely the proportion of hypertensives not diagnosed as we make doctors report the number of diagnosed hypertensives as part of their contractual arrangements. The contractual arrangements don’t specify (AFAIK) that they should screen for hypertension. As we know the prevalence of hypertension in the population, we can confidently state precisely how many people don’t know they have a blood pressure problem brewing.

  7. Carlon 04 Feb 2010 at 2:52 pm

    Steve, what is your basis for recommending that healthy patients cut their caffeine intake? It has beneficial effects and no harmful ones, as far as I know.

  8. mchandleron 04 Feb 2010 at 3:56 pm

    Dear Dr. Novella:
    Thank you for your thoughtful evaluation of the Biophysical250 and our other lab tests. You are correct that we do provide our services to the “worried well” and their concerned physicians. The assumption underlying your evaluation, however, is that they are needlessly worried. Considering only your own reported robust health, how do you know that you don’t have inherited hemochromatosis? It usually presents in later middle age but after symptoms are far advanced. Are you sure that your kidneys have suffered no ill effects from your hypertension? A serum creatinine level would only expose kidney damage after 60-70% of kidney function is lost. There are other biomarkers that expose renal problems much earlier. I don’t want to speculate on the dull throb in your lower back, but if it were something serious that you could detect early it just makes sense to do it. We’re often justly criticized for our marketing to the “worried wealthy”, but if you can afford to know what’s going in your body that could impact your health before symptoms are severe, why wouldn’t you?
    p.s. I love your blog!
    Sincerely,
    Dr. Mark Chandler

  9. Steven Novellaon 05 Feb 2010 at 8:59 am

    Mark,

    Thanks for writing, but I think you missed my main point. Without knowing true positives vs false positives, and without evidence of net clinical outcomes, we cannot know if the screening you offer is beneficial, or even not harmful. Your premise seems to be that more screening is always better – but this is naive.

    Given that 1 in 20 blood tests, on average, assuming 2 standard deviations as the norm, are going to be false positives, 250 chances for such is a lot. I understand your company’s claim that you reduce this by stacking tests – but let’s see some evidence for this. In any case – we need some studies looking at net clinical outcomes.

    Further, your test is just a snap shot. How often should people get this $3,400 screen done?

    I am, of course, not against testing or screening. But there is a rational evidence-based way to go about it. And just doing every test under the sun is not it, especially when you are charging a huge fee out of pocket.

  10. tmac57on 05 Feb 2010 at 3:59 pm

    This reminds me of the same controversy over full body CT scans that are hyped so much for preventative screening. When I had first heard of those, many years ago, my first thought was “gee, maybe my wife and I should get one, even if we had to pay the full price”. I only had second thoughts about it after Consumer Reports said that the likelihood of false positives, and unnecessary exposure to radiation, probably made the costs outweigh the benefits.
    Of course, it would be hard to convince anyone who benefited from such a test, that they should never have had it, much like telling a lottery winner that they were foolish to buy that ticket.

  11. DREadson 05 Mar 2010 at 1:01 am

    False positives can cause significant unnecessary anxiety and grief in people, which is a reason against home, do-it-yourself HIV testing. We certainly don’t want people to jump off a bridge just because of something that’s inevitable to happen quite often–false positives. A common fallacy for people unfamiliar with statistics or probability is that a test result represents reality, i.e. a positive means disease presence. Some highly anxious people may repeatedly have tests done when they have low risk factors. With enough Bernoulli trials, a test will come up positive. A patient thinks the test finally confirms their version of reality when the positive result was simply due to chance. Patient self-testing is problematic. Most laypeople don’t appreciate how hard it is to develop a good test nor understand how to evaluate the predictive utility of a test.

    Developing a diagnostic test with high sensitivity and specificity is hard. One must (1) choose a good set of observable variables (e.g. each instance is a real-valued vector or a combination of real-value variables and nominal variables), (2) make some reasonable assumptions about the data (e.g. instances are drawn IID from heteroskedastic Gaussians), (3) label each instance as positive or negative for use in estimation (this can be problematic if no method is known to absolutely confirm the presence of the target disease), (4) employ an algorithm for estimating parameters of the prediction model (a machine learning algorithm), and (5) employ statistical validation to evaluate the accuracy of the prediction framework.

    Choosing a good set of variables (step 1) is hard and often costly. For example, if one were designing a diagnostic test for brain tumors, patient-reported headache severity on a scale from 1 to 10 is one observable variable that could hypothetically be used. From a set of 100 people, if we knew which ones had brain tumors and which ones did not, we should expect even the best estimated/learned predictors to perform poorly. There just isn’t enough information in the data.

    Even when we have a set of variables with a reasonable amount of information and a good set of assumptions, we cannot expect to predict with infallibility. In the real world, false positives and false negatives are inevitable. If we assume for the moment that the data are drawn from Gaussians, false positives and false negatives are guaranteed to occur given enough trials because a Gaussian distribution has infinite support, and thus two or more Gaussians must overlap.

    By making reasonably valid assumptions, we can develop a prediction framework in which mathematically provable statistical guarantees can be made. This is difficult. As a simple example, if the data are assumed to be drawn from Gaussians with common covariance, we can predict optimally with a linear discriminant. However, if this assumption doesn’t agree with reality (i.e. the data are drawn from Gaussians of different widths), a linear discriminant will perform sub-optimally (i.e. lower risk can be achieved with better assumptions). The better the assumptions, the better the optimality guarantees that can be made.

    In statistics and machine learning we can quantify the cost of false positives and false negatives and the benefit of a true positive with a loss function. The quality of a predictor function is simply the expected loss wrt to the distribution, which is called the risk. The lowest risk achievable given vectors drawn IID from a joint distribution is the Bayes Risk. In the real world, the Bayes Risk is usually always >0, which means mistakes (false positives and false negatives) will always occur.

    The goal is to employ a learning/estimation algorithm that generates a predictor function that achieves acceptably low (and hopefully minimum) risk. A second goal, if possible, is to mathematically show that as a sample size gets larger, the risk of the predictors converges toward the Bayes Risk. When this is true, we say the algorithm is consistent. If the learner is consistent regardless of the family of distributions, it is universally consistent. k-NN was the first to be shown universally consistent (Stone, 1975) given the right rate of k and sample size.

    Algorithms prone to fitting the model to the data too closely are said to overfit. Even universally consistent algorithms are prone to overfitting and can perform poorly in practice. Some agorithms control overfitting with regularization, which puts a penalty on model complexity. Other algorithms make guarantees about the rate at which an algorithm makes progress towards the Bayes Risk.

    To make a reasonable estimate of an algorithm’s accuracy, it must undergo statistical validation. This involves applying a learned model to sequestered data. The estimate of accuracy will be biased if the sequestered data is ever used to estimate a model’s parameters. This a common mistake made when designing a diagnostic test: using the data to find the right model parameters then using the same data to demonstrate the accuracy of the prediction model.

    Sometimes a predictor function outputs a real number (sometimes called a signal) indicating the strength of its predictions. By thresholding this signal to make a final positive or negative prediction, we can tradeoff between true positive rate (tested positive given actual positive) and false positive rate (or false negative rate).

    A ROC curve plots the detection rate (y-axis) vs. false positive rate (x-axis) for each trade-off threshold, illustrating the true positive/false positive trade-off. A diagnostic test that predicts randomly would generate a line across the plot, which can be used as a baseline. If a predictor does not much better than random then it provides little to no predictive value and shouldn’t be used as a diagnostic test. One unfortunate but common problem is having a detection rate that isn’t acceptably high until an unfavorable false positive rate is chosen. For some conditions, the most accurate diagnostic tests known are prone to either many false positives or false negatives. In these instances, performing a test when it isn’t indicated can be misleading.

    The process of developing a good diagnostic test for a disease is a difficult and an error prone process. Even if a good, rigorous methodology is properly followed, there may be inherent limitations in the observable variables chosen, the assumptions made, the data collected, the process for labeling the data instances, the learning algorithm, or the final prediction model. For example, the sample size may be too small or the sample not be representative of the population for the test is targeted. The test might be erroneously applied to subjects not intended for in the test’s design, leading to false positives. The labeling of the data might be contaminated because the method for absolutely confirming disease may be error-prone. Significant money may be spent conducting a study to find a good diagnostic test for a disease and the final result may be a less than desirable ROC curve.

    While this company may or may not be reputable, other companies may exploit hypochondriacal consumers who are convinced they have certain diseases by designing tests for them that predict no more informatively than a random coin flip. Placing value on diagnostic tests with performance not much better than random prediction is as bogus as homeopathy. Without proper statistical validation, one can’t make any assumptions about the validity of a test. However, patients in need of answers are prone to swear by these tests anyway. It gives them an explanation for their ailments, serves as a false confirmation of their false self-diagnoses, or a false reassurance they don’t have a disease.

    Given the difficulty in designing a single test and demonstrating its validity, what should a consumer do if they test positive for diseases when their risk factors are low? What should a physician do when worried patients come into their office with a print out of positive results? Test batteries such as these, while they may or may not be performed by reputable companies, distract well-intentioned doctors who would ordinarily select diagnostic tests based on evidence but feel obligated to address test results put before them.