This article is a continuation of a previous post entitled “Evaluating a Fraud Detection Using Cost-Sensitive Predictive Analytics.”
Fraud detection is a cost-sensitive problem, in the sense that falsely flagging a transaction as fraudulent carriesa significantly different financial cost than missing an actual fraudulent transaction. In order to take these costs into account, companies should use a more business-oriented measure such as “Cost,” which allows companies to make decisions that are better aligned to their business objectives. This measure takes into account the actual financial gains and losses incurred in the fraud detection process and is based on the cost matrix, which relates the costs of true positives CTPi, true negatives CTNi, false positives CFPi, and false negatives CFNi, with the actual (yi) and the predicted (pi) values.
Using this cost matrix, the Cost measure for N transactions is evaluated using:
However, regardless of how the Cost is evaluated, the models don’t take into account the different misclassification costs. One model previously developed to include the different financial costs during the training phase is the cost-sensitive logistic regression (see paper), which is a natural extension of a traditional logistic regression to include the example-dependent financial costs. This model introduces example-dependent costs into a logistic regression by changing the objective function of the model to one that is cost-sensitive. The logistic regression cost function, usually referred to as the negative logarithm of the likelihood, is defined as:
Where xi and yi are the feature vector and true label of transaction i, for the N possible transactions. Moreover, hθ(xi) is the logit function using the parameters θ and is defined by:
where the outcome of this equation is the estimated probability of the transaction i being fraud (yi = 1), given its features or variables xi. The objective of the logistic regression is to find those parameters Θ that minimize the cost function, and therefore, maximize the predictive power of the algorithm. However, this cost function assigns the same weight to different errors, both false positives and false negatives. As discussed before, this is not the case in traditional fraud detection.
Which implies that:
In order to take into account the different costs during the training of the algorithm, a new cost sensitive logistic regression cost function was developed as:
With the objective of showing the performance of this new model, using a real credit card fraud dataset provided by a large European card processing company, I evaluate a logistic regression, a decision tree, a random forest, and the new cost-sensitive logistic regression. The database used for this study contains approximately 750,000 transactions and a fraud ratio of 0.467%. Moreover, the total losses due to fraud are 866,410 Euros. The algorithms are compared using the F1Score and Costs.
The results show that the most cost-sensitive logistic regression model is the one that minimizes the Cost while also maximizing the F1-Score. In this case, this model performs the best, evaluated by both measures. It is interesting to see how different the results are between a standard logistic regression and the cost-sensitive logistic regression. In conclusion, by using a model that also takes into account the real financial costs during training, further improvements are found as measured by Cost and by F1-Score.