Abstract

Statistical learning methods were applied to credit card transactions in order to detect fraud. A variety of learning techniques were explored and validated. Simple methods like logistic regression show great promise, especially given their computational efficiency at test time. Despite the size of the available data, due to the rare occurrence of fraud, a much larger dataset should be used to train models before being put into use.


Introduction

Credit and debit cards1 are being used for an unceasing number of payments.2 While likely a more secure form of payment than cash and checks, credit card fraud3 continues to threaten cardholders and card issuers. Skimming, phishing, and a variety of other technique can be utilized to grant access to an account and create unauthorized transactions. Given the high volume of credit card transactions, financial institutions must combat this potential for fraud using automated systems.

To construct a system to detect credit card fraud, statistical learning techniques have been applied to a dataset containing thousands of credit card transactions. The results show potential for such a system to be used to screen all credit card transactions for fraud, especially given that the dataset utilized is a fraction of what would be available to major financial institutions.


Methods

Data

The dataset used for this analysis contains credit card transactions made by European cardholders during September 2013. The data was accessed through Kaggle4 where it was donated by Worldline and the Machine Learning Group of the Université Libre de Bruxelles.5

These transactions occurred over a two day period, where there were 492 fraudulent charges from a total of 284,807 transactions. Thus, the data is highly imbalanced, with fraudulent charges accounting for only 0.172% of the transactions. For each transaction, three quantities are given:

  • Time of transaction (relative to first transaction in dataset)
  • Amount of transaction
  • Fraud status of transaction

For this analysis, time is not considered.

In addition to these three quantities, 28 principal components6 are available for each transaction. In order to maintain confidentiality of these transactions, the original features are not provided. Further, information about what original features were used to created these principal components is not available. We can reasonably assume that those features contain information that would be available to the card issuers at the time of transaction. We assume this information contains but is not limited to information such as:

  • Billing address of cardholder
  • Location of transaction
  • Date and time of transaction
  • Name of merchant
  • Time of transaction
  • Information about previous transactions of cardholder (amount, time, location, type, etc.)
  • Information about previous transactions of merchant (amounts, locations, cardholders, etc.)

In preparation for model training, a training dataset is created using 80% of the provided data. Within the training data, 0.173% of observations are fraudulent, which roughly matches the overall proportion in the original dataset.

Some exploratory data analysis can be found in the appendix.

Modeling

Five different classification models were trained, each using 5-fold cross-validation. The best tuning parameters were chosen using ROC.7

  • A random forest via the ranger package in R fit to training data that has been down-sampled.
mod_rf_down = train(Class ~ ., data = cc_trn, 
                    method = "ranger", 
                    metric = "ROC", 
                    trControl = cv_down)
  • A logistic regression with no subsampling of the training data.
mod_glm_cv = train(Class ~ ., data = cc_trn, 
                   method = "glm", 
                   metric = "ROC", 
                   trControl = cv)
  • A logistic regression fit to training data that has been up-sampled.
mod_glm_up = train(Class ~ ., data = cc_trn, 
                   method = "glm", 
                   metric = "ROC", 
                   trControl = cv_up)
  • A logistic regression that has been down-sampled.
mod_glm_down = train(Class ~ ., data = cc_trn, 
                     method = "glm", 
                     metric = "ROC", 
                     trControl = cv_down)
  • A logistic regression that was fit to training data that used SMOTE8 to correct for class imbalance.
mod_glm_smote = train(Class ~ ., data = cc_trn, 
                      method = "glm", 
                      metric = "ROC", 
                      trControl = cv_smote)

Models selection and evaluation is discussed in the results section.


Results

The table below shows the result of fraud predictions on the test data using a logistic regression model fit to the training data with an up-sampling procedure to combat the effect of the massive class imbalance. Additional intermediate tuning results can be found in the appendix. Due to computational limitations, only a random forest that utilizes down-sampling is presented. While the best result can be found within the random forest model, it does not appear to be significantly different than the logistic regressions considered. As a result, a logistic regression is chosen as its simplicity is especially helpful at test time as it can compute predictions much faster than the random forest.

Models were tuned for ROC, but sensitivity was also considering when choosing a final model. Aside from the logistic regression model without any subsampling, all models had similar performance.

Table: Test Results, Up-sampled Logistic Regression
Truth
Fraud Genuine
Predicted: Fraud 91 1402
Predicted: Genuine 7 55461

Within this test data, 7.1% of fraud is being misclassified as genuine, while 2.5% of genuine transactions are being labeled as fraud. We note that these two values could be further manipulated by adjusting the threshold for labeling a transaction fraud.


Discussion

The results above are somewhat encouraging. While our model does produce errors, of both types, given the application, we believe this analysis demonstrates a proof-of-concept for an automated fraud detection system. Using more data, both samples and features, this model could likely be improved before being put into practice.

We consider false positives produced by this model (detecting fraud in a genuine transaction) to be the lesser for the two potential errors. By labeling these transactions as fraud, the card issuer could simply deny them. (Provided they are detected in real-time.) This comes at the cost of cardholder inconvenience. Also, in modern banking, this denial is often coupled with messaging (text, email, or phone call) to the cardholder that can be used to identify these false positives and bypass the denial. The cost of doing so should be considered when using a model such as this in practice.

Table: Test Data False Negatives
Predicted Actual Amount
genuine fraud 0.00
genuine fraud 0.92
genuine fraud 1.79
genuine fraud 2.47
genuine fraud 45.03
genuine fraud 99.99
genuine fraud 319.20

False negatives (failure to detect a true fraud) should be investigated further. The table above lists the seven false positives found in the test data. Although this is an extremely limited sample, we make two comments. First, the total amount of these transactions comes to 469.4. Per transaction that is 0.0082407. While this is extremely small, we note one additional consideration that suggests that amount not fully explain the risk of false negatives.

Within this data, there are a large number of transaction for either 0.00 or 0.01. The rate of occurrence of transactions for 0.00 or 0.01 far exceeds those for 0.02 through 0.10. It is unclear what these transaction could be for, but one potential is for use in account verification when linking a credit card to third party services. So while they have little monetary value, they create a potential security risk. While, we cannot make any strong conclusion, it is worrying that a fraudulent transaction for 0.00 appears among the false negatives.

Because fraud detection should be performed in real-time, that is, the card issuer would like to detect fraudulent transactions as they occur, the speed of predictions from our trained models must be taken into consideration. While the random forest model had similar detection performance, when making predictions at test time, it was on average 8.8 times slower. While this probably actually amounts to a very small absolute amount of time, when considering the volume of transactions, every bit of time counts.

By using principal components, instead of original features, this model is essentially a black box. With access to the full feature set, and using interpretable models, some additional sanity checks could be made about the predictions of this model.


Appendix

Data Dictionary

  • V1 - V28 - 28 principal components based on an unknown set of input features that contain information about each transaction.
  • Amount - Amount of transaction.
  • Class - Transaction label: fraud or genuine.

For additional information, see documentation on Kaggle.9

EDA

Table: Statistics by Outcome, Training Data
Transaction Amount
Class Count 10th Percentile Median 90th Percentile
fraud 394 0.76 12.31 324.344
genuine 227452 1.00 22.00 203.136

Additional Results

Table: Random Forest, Down-sampling
mtry min.node.size splitrule ROC Sens Spec ROCSD SensSD SpecSD
2 1 gini 0.975 0.888 0.982 0.015 0.040 0.007
2 1 extratrees 0.978 0.868 0.991 0.012 0.048 0.003
15 1 gini 0.975 0.906 0.959 0.013 0.044 0.013
15 1 extratrees 0.977 0.888 0.980 0.012 0.040 0.004
29 1 gini 0.977 0.901 0.964 0.013 0.041 0.008
29 1 extratrees 0.979 0.896 0.972 0.012 0.041 0.003
Table: Logistic Regression, No Subsampling
parameter ROC Sens Spec ROCSD SensSD SpecSD
none 0.973 0.609 1 0.011 0.08 0
Table: Logistic Regression, Up-sampling
parameter ROC Sens Spec ROCSD SensSD SpecSD
none 0.978 0.901 0.976 0.018 0.024 0.001
Table: Logistic Regression, Down-sampling
parameter ROC Sens Spec ROCSD SensSD SpecSD
none 0.961 0.911 0.938 0.034 0.032 0.041
Table: Logistic Regression, SMOTE
parameter ROC Sens Spec ROCSD SensSD SpecSD
none 0.971 0.899 0.975 0.014 0.022 0.003

Instructor Notes

Some meta-notes about this analysis:

Table: Class Proportions by Dataset
Dataset Fraud Genuine
Train 0.0017 0.9983
Test 0.0017 0.9983

  1. Wikipedia: Credit Card

  2. Federal Reserve: Payment Systems Study

  3. Wikipedia: Credit Card Fraud

  4. Kaggle: Credit Card Fraud

  5. Machine Learning Group of ULB

  6. Wikipedia: Principal Component Analysis

  7. Although, for the logistic regression, no tuning was actually done. The specification of a metric is largely to suppress warnings in caret.

  8. SMOTE: Synthetic Minority Over-sampling Technique

  9. Kaggle: Credit Card Fraud

  10. Wikipedia: Streaming Data