## Data Scientist Probability Puzzle: Rare Disease

A recent VentureBeat article with a sexy title”4 questions that reveal whether you have the hottest job skill of 2014″ by Gregory Ferenstein, has an interesting puzzle for probability. It’s a Bayesian problem in disguise, and actually the result and derivation is not so straightforward.

1. Suppose you went cave diving in Mexico and have contracted a rare disease. You go to the doctor, and he tells you that the test for such a disease is correct 99 percent of the time, but since it is so rare, it only occurs randomly in the population 1 out of 10,000 times.

**Your tests come back positive. What are the chances you have the disease?**

A) 99%

B) 90%

C) 10%

D) 1%

**Victor Fang’s solution: **

Well this is not a trick question so obviously you should not pick A. 🙂

There is a hard way with dull math for Bayesian theorem, and an easy way “invented” by Victor Fang. Let me start with the easy way.

Let’s assume we have 1 Million = 1,000,000 people in the pool for experiment.

- Denote D as “Diseased” and -D as “Not Diseased aka healthy”;
- Denote T as “Tested positive” and -T as “Tested negative”.

Hence let’s summarize what we are given:

- P(D) = 1/10,000 = 0.01% , the chance of getting this rare disease.
- P(T|D) = 99% , the probability of: if you have disease, and the test comes out positive.
- P(-T | -D) = 99%, OK, here is what differentiates a data scientist from a software engineer. 🙂 The puzzle says “
**the test for such a disease is correct 99 percent of the time**“, which also indicates if you are diagnosed as Non Disease, you actually still have 1% chance of getting that rare disease! In probability this is also related to the**sensitivity and specificity.**

Now, **our goal** is to get this enchanted conditional probability : **P(D|T)** , i.e. if you are diagnosed as positive, what’s the actually probability of actually having that disease?!?

As I said, I have an easy solution for this. Let’s start with the famous contingency matrix. 🙂

**Note that we can do this because this puzzle is a simple Binomial distribution. For more sophisticated ones like continuous distributions, you might need to leverage computer programming if you want to solve this way.**

#people | D | -D | subtotal of rows |

T | |||

-T | |||

subtotal of columns | 1,000,000 (total population) |

Let’s start filling the holes while inserting the facts we derive!

First of all we know that #diseased should be 1M X 1/10K = 100 people. which could resolve the last row of the table.

#people | D | -D | subtotal of rows |

T | |||

-T | |||

subtotal of columns | 100 | 999,900 | 1,000,000 (total population) |

Secondly, we know within the actual diseased population, the #people that are diagnosed as positive are 99%, i.e. P(T|D)=99%. Plug into the table!

#people | D | -D | subtotal of rows |

T | 99 | ||

-T | 1 | ||

subtotal of columns | 100 | 999,900 | 1,000,000 (total population) |

Thirdly, similarly for P(-T|-D) . Plug in into the 3rd column of the table!

Basically, this step is calculating the distribution within the healthy population.

#people | D | -D | subtotal of rows |

T | 99 | 9,999 | |

-T | 1 | 989,901 | |

subtotal of columns | 100 | 999,900 | 1,000,000 (total population) |

Fourth step: let’s fill out the 2 missing spots left!

#people | D | -D | subtotal of rows |

T | 99 | 9,999 | 10,098 |

-T | 1 | 989,901 | 989,902 |

subtotal of columns | 100 | 999,900 | 1,000,000 (total population) |

Last step: take a deep, deep breath before Dr. Fang discloses the secret! 🙂

OK remember the goal is to derive **P(D|T)** !

It’s simple because we have done the heavy lifting together in the last 3 minutes:

P(D|T) = 99 / 10,098 = 0.0098 ~= **1%** !

The careful readers might ask: Dr Fang, why the heck is so low!? Though equipped with the seemingly pretty high detection rate diagnosis test , the chance of getting the disease even with a positive result still vanishes !?

The short answer resides in the Bayesian theorem: P(A|B) = P(B|A) / P(B)* P(A). The P(A) is the Prior distribution, in this case, it tells you that , Hey, whatever result you get I would re-weight it because mother nature (or the domain expert) tells us the truth about how it should be like. In this case, the P(D) is only 0.01% and it’s a rare event. Expect your results be discounted.

After digging into the literature, I found that this puzzle dated back to Columbia University Prof. Chris Wiggins‘s Scientific American article in 2006.

http://www.scientificamerican.com/article/what-is-bayess-theorem-an/

Hope this help you advance one more step into your data science analysis in the complicated world.