Home > life is fun > Data Scientist Probability Puzzle: Rare Disease

Data Scientist Probability Puzzle: Rare Disease


A recent VentureBeat article with a sexy title”4 questions that reveal whether you have the hottest job skill of 2014″ by Gregory Ferenstein, has an interesting puzzle for probability. It’s a Bayesian problem in disguise, and actually the result and derivation is not so straightforward.

http://venturebeat.com/2014/12/18/4-questions-that-reveal-whether-you-have-the-hottest-job-skill-of-2014/

1. Suppose you went cave diving in Mexico and have contracted a rare disease. You go to the doctor, and he tells you that the test for such a disease is correct 99 percent of the time, but since it is so rare, it only occurs randomly in the population 1 out of 10,000 times.

Your tests come back positive. What are the chances you have the disease?

A) 99%
B) 90%
C) 10%
D) 1%

Victor Fang’s solution: 

Well this is not a trick question so obviously you should not pick A. 🙂

There is a hard way with dull math for Bayesian theorem, and an easy way “invented” by Victor Fang. Let me start with the easy way.

Let’s assume we have 1 Million = 1,000,000 people in the pool for experiment.

  • Denote D as “Diseased” and -D as “Not Diseased aka healthy”;
  • Denote T as “Tested positive” and -T as “Tested negative”.

Hence let’s summarize what we are given:

  • P(D) = 1/10,000 = 0.01% , the chance of getting this rare disease.
  • P(T|D) = 99% , the probability of:  if you have disease, and the test comes out positive.
  • P(-T | -D) = 99%, OK, here is what differentiates a data scientist from a software engineer. 🙂  The puzzle says “the test for such a disease is correct 99 percent of the time“, which also indicates if you are diagnosed as Non Disease, you actually still have 1% chance of getting that rare disease! In probability this is also related to the sensitivity and specificity.

Now, our goal is to get this enchanted conditional probability : P(D|T) , i.e. if you are diagnosed as positive, what’s the actually probability of actually having that disease?!?

As I said, I have an easy solution for this. Let’s start with the famous contingency matrix. 🙂

Note that we can do this because this puzzle is a simple Binomial distribution. For more sophisticated ones like continuous distributions, you might need to leverage computer programming if you want to solve this way.

#people D -D subtotal of rows
T
-T
subtotal of columns 1,000,000 (total population)

Let’s start filling the holes while inserting the facts we derive!

First of all we know that #diseased should be 1M X 1/10K = 100 people. which could resolve the last row of the table.

#people D -D subtotal of rows
T
-T
subtotal of columns 100 999,900 1,000,000 (total population)

Secondly, we know within the actual diseased population, the #people that are diagnosed as positive are 99%, i.e. P(T|D)=99%. Plug into the table!

#people D -D subtotal of rows
T 99
-T 1
subtotal of columns 100 999,900 1,000,000 (total population)

Thirdly, similarly for P(-T|-D) . Plug in into the 3rd column of the table!
Basically, this step is calculating the distribution within the healthy population.

#people D -D subtotal of rows
T 99 9,999
-T 1 989,901
subtotal of columns 100 999,900 1,000,000 (total population)

Fourth step: let’s fill out the 2 missing spots left!

#people D -D subtotal of rows
T 99 9,999 10,098
-T 1 989,901 989,902
subtotal of columns 100 999,900 1,000,000 (total population)

Last step: take a deep, deep breath before Dr. Fang discloses the secret! 🙂
OK remember the goal is to derive P(D|T) !

It’s simple because we have done the heavy lifting together in the last 3 minutes:

P(D|T) = 99 / 10,098 = 0.0098 ~= 1% !

The careful readers might ask: Dr Fang, why the heck is so low!? Though equipped with the seemingly pretty high detection rate diagnosis test , the chance of getting the disease even with a positive result still vanishes !?

The short answer resides in the Bayesian theorem: P(A|B) = P(B|A) / P(B)* P(A). The P(A) is the Prior distribution, in this case, it tells you that , Hey, whatever result you get I would re-weight it because mother nature (or the domain expert) tells us the truth about how it should be like. In this case, the P(D) is only 0.01% and it’s a rare event. Expect your results be discounted.

After digging into the literature, I found that this puzzle dated back to Columbia University Prof. Chris Wiggins‘s Scientific American article  in 2006.

http://www.scientificamerican.com/article/what-is-bayess-theorem-an/

Hope this help you advance one more step into your data science analysis in the complicated world.

Advertisements
Categories: life is fun
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: