Evaluating a Classification Model
A Classification model identifies to which of a set of categories a new observation belongs. Classification is a supervised learning approach in which a target variable is categorical.
Evaluating a machine learning model is as important as building it. We are creating models to perform on new and unseen data. Hence, we need to evaluate if our model is performing correctly. Evaluating a classification model is not easy because there are a lot of evaluation metrics. When to use which metric depends on the data and problem of the project.
In this post, I will go over the main evaluation metrics for the classification model and when to use them.
Accuracy
Accuracy is just how many predictions are correct. It is calculated as the number of correct predictions divided by the total number of predictions.
Accuracy is not a correct measure if there is a class imbalance in your data. Let's assume the classes in the dataset are 90% of class A and 10% of the class B. If the model is only predicting class A, the accuracy of this model is 90%.
So, Accuracy is not a correct measure if there is a class imbalance in the data.
When to use: Accuracy is a good evaluation metric if there is not much class imbalance in data.
Confusion Matrix
It is a matrix consisting of the number of True positives, True negatives, False positives, and False negatives. Confusion matrix by itself is not an evaluation metric but values in the Confusion matrix can create more evaluation metrics.
- True Positive (TP): Correctly predicted to be a positive class.
- False Positive (FP): Incorrectly predicted to be a positive class.
- True Negative (TN): Correctly predicted to not be a positive class.
- False Negative (FN): Incorrectly predicted to not be a positive class.
Precision
Precision measures how good the model is in predicting the positive class. If the model predicts positive, how often it is correct. The focus of precision is on positive predictions.
When to use: It is used when you care about positive predictions to be correct.
Recall or Sensitivity or True Positive Rate
The recall is a measure of how good the model is in identifying the actual positives. It indicates how much of a positive class did the model identified correctly. The focus of recall is the actual positive class.
When to use: It is used when you care about actual positive to be classified correctly.
F1 Score
F1 score is the harmonic mean of Precision and Recall. F1 score values between 0 and 1, where 0 is worst and 1 is best. F1 score is used when you want a balance in Precision and Recall.
When to use: F1 score can be used if there is a class imbalance and when you want a balance between Precision and Recall.
Specificity or True Negative Rate
Specificity is the metric that evaluates a model’s ability to predict true negatives. Specificity is the exact opposite of Recall. When the actual value is negative, how often is the prediction correct?
When to use: It is used when you care about actual negative to be classified correctly.
ROC curve and AUC
ROC curve summarizes the performance of the model at different threshold values by combining confusion matrices at all threshold values. The X-axis of the ROC curve is the true positive rate (sensitivity) and the y-axis of the ROC curve is the false positive rate (1- specificity). A perfect model can have an AUC value 1.
When to use: You should use it when you care equally about positive and negative classes. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.
Conclusion
There is no optimal metric to use for all tasks. It is important to consider what question you are trying to answer when deciding what evaluation metric to use. You have to decide which evaluation metric to use depending on your data and your problem.