white building with data has a better idea text signage

In this article, we are going to discuss the most popular performance evaluation metrics in all Data Science, Machine Learning and Deep Learning. In particular, we will address the accuracy, precision, recall, F1-Score specificity, and the Receiver Operating Characteristic (ROC) curve.

Table of Content

  1. Introduction
  2. Confusion Matrix
  3. Case Study: Binary Classification Model for Cancer Detection
  4. Accuracy
  5. Precision
  6. Recall
  7. F1-Score
  8. Specificity
  9. ROC-Curve
  10. Multiclass Classification

1. Introduction

After you have implemented your machine learning model or even a neural network you can start to deal with the question of whether the goal of your project is achieved. In other words, you check if your models fulfill the necessary predefined performance.

The evaluation metrics that you will choose to measure the performance of your model will differ from project to project and will depend on what exactly you want to achieve.

That is the reason why you have to be specifically clear about your goal. As clear as your goal is as clear you must define the metrics that measure the performance.

Be careful. Some evaluation metrics might suggest your model performs very well, while others may result in a bad performance.

In this lecture, you will learn the meaning of the most frequently used evaluation metrics and which evaluation metrics should be used in which case.

2. Confusion Matrix

Before we can discuss the evaluation metrics we must discuss the concept behind the so-called confusion matrix.

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a table layout that allows the visualization of the performance of an algorithm.

More precisely a confusion matrix is a summary of prediction results in a classification problem. Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making. Before we can talk about the various evaluation metrics which can be used to measure the performance of your model it is very important to understand how a confusion matrix works because all important metrics can be derived from the confusion matrix.


The type of confusion matrix I will introduce at this point is used only for binary classifications. But once you understand the concept behind it, you can extend your knowledge on a multi-class classification problem.

Let’s take a look at how a confusion matrix looks like.

Confusion Matrix

The confusion matrix for a binary classification problem is a 2x2 matrix. Each row of the matrix represents the instances in a predicted class while each column represents the instances of the actual class. Each of the four matrix entries corresponds to the number of outcomes your model has predicted.

Each entry has a specific meaning behind it. As always the easiest way to understand how something is working is an example to look at.

3. Case Study: Binary Classification Model for Cancer Detection

Assume we build a binary classification model that helps doctors in a hospital to predict whether a patient has cancer or not, based on his/hers medical data.

After we have implemented and trained the model we want to evaluate its performance on yet-unseen data. We test our model on medical data of 100 000 patients. 1192 of which have cancer, while the rest have not.

Assume the model has correctly identified 542 patients who actually have cancer. These predictions are called True Positives which we abbreviate to TP and make the corresponding entry in the confusion matrix:

Confusion Matrix 2

The model could also correctly identify 98700 patients with no cancer. The correctly identified instances of no-cancer are called True Negatives or TN. The confusion matrix now looks as follows:

True Positives

Of course, the model is not perfect and did also some mistakes during the classification. The model predicted 108 patients having cancer, which in reality did not have cancer. These incorrect predictions are called False Positives (FP) or a Type I Error:

False Positives

While still being an error, this error might not by that important as it may seem. In reality, the consequences would be that these patients might be medically examined. Still, this error would not mean any harm to the health of the patients.

On the contrary, however, the model predicted 650 patients not having cancer which in reality do have cancer. These incorrect predictions are called False Negatives (FN). This is a Type II Error that is much worse than the Type I Error.

False Negatives.

In this case, 650 patients who require a cancer treatment might not get it because they were not identified by the model. And that is potentially much more dangerous than having a Type I Error because in this case, the error is a direct danger to the health of the patients.

Now that we have gained a visual overview and the number of certain outcomes we can examine the performance of this model more closely.

Summary: True Positives, True Negatives, False Positives, False Negatives

Def. True Positive: A true positive represents the prediction or outcome where the model correctly predicts the positive class. Usually the class of interest (Such as a patient with cancer.)

Def. True Negative: Similarly, a true negative is the prediction/outcome where the model correctly predicts the negative class. Usually the class with of less interest.

Def. False Positive: A false positive represents the prediction of outcome of the model where the positive class is predicted incorrectly. The class in question is actually a negative class.

Def. False Negative: A false negative is the prediction where the model incorrectly predicts the negative class. The class in question is acually a positive class.

4. Accuracy

One of the most commonly used metrics to measure performance is called accuracy. The accuracy tells the ratio of correctly made predictions of a model.

To obtain the accuracy you must divide the number of correct predictions (meaning true positives and true negatives) by the number of all made predictions.


In our case, we obtain an accuracy of 0.992 which means that 99.2 % percent of all predictions made by the model were correct.

When should we use Accuracy?

We should use accuracy to evaluate the performance of models for classifications problems, where the classes are well balanced. This means the ratio of the classes in a dataset must be approximately the same. A ratio of (for example) 90% / 10% would be not suitable in this case.

5. Precision

Another important metric is called precision. The precision gives the ratio of true positives that were correctly predicted by the model. In other words, the precisions try to answer the following question: Given all true positives, how many of these were correctly predicted by the model?

In our example, the precision would answer the following questions: “Of all patients that the model predicted to have cancer, how many of these patients do actually have cancer?

Mathematically you get the precision by dividing the number of true positives by the number of all positive calls the model made.


In our case, we divide 542 by 650. As a result, we get a precision of 83.3 %. 83.3 % is a pretty good value for precision. It tells us that if the model classifies a patient having cancer, 83.3 % of the time the model is actually correct.

When should we use Precision?

We should use precision as an evaluation metrics if we want to be very sure of the correctness of the prediction. For example: If we are building a system to predict if we should decrease the credit limit on a particular account, we want to be very sure about our prediction or it may result in customer dissatisfaction.

6. Recall

The recall also called the sensitivity is a metric that tells us how good a model is in identifying true positives. In our example, the recall tries to answer the following question: “How good or how sensitive is the model in identifying cancer in patient’s data in general?

While the precision tells us how many of the patients who were classified by the model to have cancer, actually have cancer, the recall, on the other hand, tells us what percentage these correctly classified cancer-patients make out of all cancer patients in the dataset.

The model identified 542 true positives. Meaning the model correctly predicted 542 patients to have cancer. But the number of all patients in the dataset who have cancer is 1192. The perfect recall or sensitivity would be if the model could identify all these 1192 patients. The recall is calculated by the division of the number of true positives by the total number of patients having cancer:


The total number of patients having cancer is the sum of true positives and false negatives. As a result, we get a recall of 45.4 %.

This is a pretty low recall. It means that the model could identify only 45.4 % of patients in the dataset who have cancer. More than half of cancer-patient, on the other hand, are invisible to the model.

This means more than half of patients who require a medical treatment might not get it.

When should we use Recall?

We want to use recall for the performance evaluation of a model if we want to identify as many true positives as possible. For example: If we are building a system to predict if a person has cancer or not, we want to capture the disease even if we are not very sure (Just as in our example).

7. F1-Score

F1-Score is a metric that combines both the recall and precision of a model. More precisely the F1-Score is a harmonic mean of both metrics. This metric was introduced to measure the performance of a model without explicitly specifying recall and precision values, but to combine them into a single metric:

F1 Score

In our case, the F1-Score has a value of 0.558. In opposite to the other metrics, this value does not have an intuitive explanation. It is just a mathematical term that takes precision and recall into account.

When should we use F1-Score?

We should use recall if we want both recall and precision to be as high as possible.

8. Specificity

Specificity is an evaluation metric that can be considered as the opposite recall. The specificity tells us the ratio of correctly identified negative instances to all negative instances in the data set. In our case, it would be the ratio of correctly classified patients without cancer to all patients in the dataset who do not have cancer:


Our model could identify 98 700 instances correctly while 108 patients without cancer were misclassified as having cancer. This leads to a specificity value of 99.8%.


After we have evaluated the model based on different metrics we can come up with the conclusion. As you have may recognize the accuracy of a model is not always meaningful.

Although we got a very high accuracy of 99.2%, we later saw that the model was not as good as we thought at first. In this particular case, we have dealt with the problem of a very unbalanced dataset. Out of 100,000 patients in the dataset, we had only 1192 cases of cancer and a total of 9808 without.

Unfortunately, unbalanced data sets are a very common problem in data science. Usually, the things you want to predict are very rare. I hope that by now you are convinced of the importance of early goal setting for your project.

Imbalanced Dataset

If our goal were to predict any class, we would actually have a very strong model that performs correctly 99.2% of the time. But that goal would not be very wise. Actually, we really want to identify the cancer patients (true positives) so that medical treatment can be initiated to help them.

The high accuracy resulted from the large number of negative instances (true negatives), which are patients without cancer. The model could easily learn from this large amount of data samples to identify these negative instances very efficiently.

In the meantime, the really interesting class which is a patient with cancer (true positives) was a very rare event to learn from.

While this was just a theoretical example, you will find these kinds of datasets in practice very often. This is why I introduced the precision and recall. These metrics much better suited to evaluate the performance of a model trained on an imbalanced dataset

Precision vs Recall

Given our goal to classify these rare events, these metrics are better suited to evaluate the performance of the model. But again, we must first define the goal.

If we want to measure the success of the model based on its capability to identify the patients with cancer then we want to achieve a high recall. On the other side having high recall may result in low precision. Low precision means that while the model identifies a lot of instances of interest the model also makes a lot of mistakes down the line.

If we want to have a model that should make as few mistakes as possible when predicting whether a patient has cancer or not, we must aim for high precision. But on the other side, the model will miss a lot of these cancer-patients (only very few true positives).

In fact when working on machine learning / deep learning problems you will notice a tradeoff between recall and precision. It is very difficult to increase both metrics at the same time. While increasing the recall you will notice a downfall in precision and vise versa.

To decide what metric is more important for you, depends on the goal you have set for the project.

9. Receiver Operating Characteristics (ROC) - Curve

ROC curve indicates how well the probabilities for the positive classes are separated from the negative classes.

Another way to visualize and measure the performance of a classification model is to use the so-called Receiver Operating characteristic curve. To get this curve we must plot the recall of a model against another value which 1-specificity.

And we are doing this for several classification probability thresholds. What do I mean by this? In the previous example we performed a binary classification of patients into two different classes. One class was that this patient has cancer, and the other class was that this patient is healthy. For such classifications, it is usually agreed that if the model predicts a probability score greater than 0.5 for belonging to a particular class, it means that the data instance belongs to that class.

The opposite is true for predictions scores below 0.5. For our previous example, this would mean that if the model predicts a probability of more than 0.5 that a patient has cancer, this is considered true

For the receiver operating characteristics, we use several thresholds in a range between 0 and 1. For each threshold we get different values of true positives, true negatives and so on, which can be used to calculate the recall and the value for 1-specificity. In the end, we plot several recall 1-specificity pairs in one single diagram and get something that looks like this:


This diagram shows the ROC curve for three different models that perform the same classification task. To determine the performance of each model we must measure the area under the curve, the so-called AUC for each of these three curves.

Higher AUC means that the corresponding model can better distinguish between two different classes (true positives, true negatives). Here Model 1 is better than Model 2, which on the other hand is better than Model 3.

An area of 1 means that the model is perfect and makes always the correct classification. The worst possible area under the curve of 0.5 means that the model can not distinguish between the classes at all.

In general, I would recommend achieving an area under the curve which has a value greater than 0.8.

10. Multi-Class Classification

The presented evaluation metrics in the previous lectures were presented on a binary classification problem. Of course, all of those concepts can be extended to a multi-class classification where you may deal with 3,5 or even dozens of classes.

To do so you must apply the concept of true positives and true negatives for each given class label individually. Subsequently, you can either examine the individual performance of your model on a class label or you can compute the overall performance by averaging the individual performances.