Confusion Matrix

Does high accuracy imply that a test is a useful metric? Well, it depends. Lets take a look why accuracy can be misleading in certain situations. We will use confusion matrix, which most basic statistics courses introduce in some fashion.

Malignant Benign
Test MalignantTPFPPPV = \dfrac{TP}{TP+FP} (Precision)
Test BenignFNTNNPV= \dfrac{TN}{FN+TN}
TPR=\dfrac{TP}{TP+FN}
(Sensitivity, Recall)
TNR=\dfrac{TN}{FP+TN}
(Specificity)
Accuracy = \dfrac{TP+TN}{TP+FN+FP+TN}
F2= \dfrac{5*PPV*TPR}{4*PPV+TPR}

First, lets examine what happens with a perfect test, which obviously does not exist in any real world scenario.

Malignant Benign
Test Malignant1000PPV = \dfrac{100}{100+0}=1 (Precision)
Test Benign0100NPV= \dfrac{100}{0+100}=1
TPR=\dfrac{100}{100+0}=1
(Sensitivity, Recall)
TNR=\dfrac{100}{0+100}=1
(Specificity)
Accuracy = \dfrac{100+100}{100+0+0+100}=1

As you can see, everything is perfect in this scenario: accuracy, sensitivity, specificity, PPV, NPV are all at 100%.

Now, lets make this slightly more real and throw in one false positive and one false negative.

Malignant Benign
Test Malignant991PPV = \dfrac{99}{99+1}=0.99 (Precision)
Test Benign199NPV= \dfrac{99}{1+99}=0.99
TPR=\dfrac{99}{99+1}=0.99
(Sensitivity, Recall)
TNR=\dfrac{99}{1+99}=0.99
(Specificity)
Accuracy = \dfrac{99+99}{99+1+1+99}=0.99

All of the metrics dropped by 1% to 99%.

Lets start moving towards real world scenarios. The real prevalences are usually not 1:1. We will keep our “test” near perfect and only have 1% of FP and FN. But, there will be asymmetric prevalence of 1:9 of some condition. For example, for one malignant soft tissue sarcoma, there are approximatelly nine lipomas, neuromas, ganglion cysts, hemangiomas and other benign lesions.

In this example, we are testing 1000 cases. So, using some gold standard, 100 will be diagnosed as malignant and 900 will be diagnosed as benign. Our test will still be very accurate, sensitive and specific. How does a test that only gets 1% wrong, performs on different metrics when prevalence is not symmetric:

Malignant Benign
Test Malignant999PPV = \dfrac{99}{99+9}=0.92 (Precision)
Test Benign1891NPV= \dfrac{891}{1+891}=>0.99
TPR=\dfrac{99}{99+1}=0.99
(Sensitivity, Recall)
TNR=\dfrac{891}{9+891}=0.99
(Specificity)
Accuracy = \dfrac{99+891}{99+1+9+891}=0.99
Prevalence Ratio19

The accuracy, specificity and sensitivity stayed the same, but precision dropped by 8%.

If we take one further step in the direction of reality, we realize that many research studies report that their test is about 90% accurate, sensitive and specific. Let’s see:

Malignant Benign
Test Malignant9090PPV = \dfrac{90}{90+90}=0.50 (Precision)
Test Benign10810NPV= \dfrac{810}{10+810}=0.99
TPR=\dfrac{90}{90+100}=0.90
(Sensitivity, Recall)
TNR=\dfrac{90}{90+810}=0.90
(Specificity)
Accuracy = \dfrac{90+810}{90+10+90+810}=0.90
Prevalence Ratio19

The PPV (precision) dropped to 50%. The accuracy, sensitivity and specificity are at 90% and NPV is 99%. What is going on here? By attempting to increase our accuracy, we settled on minimizing false positives and false negatives. Furthermore, since the benign condition was more prevalent than malignant condition, NPV benefited from the large number of TNs. But, the test is no longer precise.

Here is a question: how would you like to get into a commercial plane that is 99% safe? It is highly doubtful that aviation would survive as an industry if every 100th plane was not safe. When 2 of 387 737-MAX8 planes crashed (0.5%), the entire 387 planes were grounded until the problem was fixed. There are more than 25,000 commercial planes in the world. In 2018, there were 13 fatal incidents (less than 0.05%). More than 99.95% of planes were safe from a fatal incident.

Using similar logic, we would like to have a 100% (or at least extremely close to 100%) sensitive test when it comes to detecting malignancies. The high sensitivity can only be achieved at the expense of precision. Why at the expense of precision? Because as false negatives drop, false positives rise. Numerically, FPs will rise much more – prevalence, remember, is 1:9 in our example.

To achieve 100% or very close to 100% sensitivity, real world scenario would look something like this:

Malignant Benign
Test Malignant100300PPV = \dfrac{100}{100+300}=0.33 (Precision)
Test Benign0600NPV= \dfrac{600}{0+600}=1
TPR=\dfrac{100}{100+0}=1
(Sensitivity, Recall)
TNR=\dfrac{600}{300+600}=0.67
(Specificity)
Accuracy = \dfrac{100+600}{100+0+300+600}=0.7
Prevalence Ratio19

To achieve 100% (or extremely close to 100%) certainty that a positive test is truly sensitive (TPR=100%), accuracy, specificity and precision have to drop in the real world scenarios.

At this point, our discussion becomes philosophic rather than scientific. What is an acceptable and reasonable sensitivity goal for a test used in health care. Many research publications are happy to publish 90% or 95% sensitive results of a new test. This translates to 1 in 10, or 1 in 20 false negatives. Would you board a plane, which safety was checked with an instrument that was wrong 1 in 10 times? Continuing with our airplane analogies, we see that humans accept 99.95% safety record as reasonable for air travel. This translates to 5:10,000 or 1:2000 of false negatives for sensitivity calculation. To achieve 99.95% sensitivity with, say, 50% precision, here is how confusion matrix for a test of a condition with asymmetric prevalence would look:

Malignant Benign
Test Malignant19993998PPV = \dfrac{1999}{1999+3998}=0.5 (Precision)
Test Benign114002NPV= \dfrac{14002}{14002+1}=0.9999
TPR=\dfrac{1999}{1999+1}=0.9995
(Sensitivity, Recall)
TNR=\dfrac{14002}{14002+3998}=0.7779
(Specificity)
Accuracy = \dfrac{1999+14002}{1999+1+3998+1}=0.8
Prevalence Ratio19

When we come across a paper that reports 95% accuracy, we should look under the hood, and see at what cost to sensitivity, specificity, and precision such accuracy was obtained.