This article recently appeared in Issue 30 of Parallel Universe Magazine
One challenging―but also very important―task in data analytics is dealing with outliers. We generally define outliers as samples or events that are inconsistent with the rest of data population1. An outlier often contains useful information about abnormal characteristics of the systems and entities that impact the data generation process2.
Common applications for outlier detection algorithms include:
In this article, we’ll focus on one of the most common applications of outlier detection—credit card fraud. With some simple outlier detection approaches, it’s possible to find 75 to 85 percent of fraudulent transactions―with false alarms less than 1 percent of the time―on a real-world dataset.
There are two basic approaches to detecting outliers:
Breaking it down further, there are three basic supervised approaches:
Unsupervised approaches include:
We’ll compare these techniques and evaluate each methodology based on certain design criteria. Accuracy metrics will include both detection and false positive rates7.
It’s important to remember that fraud detection is about more than capturing fraudulent events. It also requires capturing these activities as quickly as possible—so an algorithm’s performance is an important metric8. In our test, we’ll use some popular machine learning libraries, including Intel® Data Analytics Acceleration Library (Intel® DAAL), that implement algorithms for outlier detection to compare results for both accuracy and performance.
We used a dataset9 from Kaggle*, a platform for predictive modeling and analytics competitions in which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models10. We chose the most popular real-world dataset on credit card fraud detection from Kaggle11. It contains 31 numerical input variables (Figure 1).
The time attribute represents seconds elapsed between each transaction and the first transaction in the dataset. Amount represents the transaction amount, as the name implies. The remaining 28 features are the principal components obtained with PCA. The original features are not provided due to confidentiality issues.
Class is the response variable. It equals 1 in case of fraud and 0 otherwise. The dataset is highly unbalanced because fraud transactions are far less common than normal ones. Frauds represent only 0.172 percent of all transactions (492 frauds out of 284,807 transactions).
Let’s take a look at conditional distributions of some of principal components (Figures 2 and 3). Principal components are not correlated, so dropping those with similar conditional distribution for fraudulent and normal examples could help us build more accurate models. We’ll use a point-biserial correlation coefficient12 to measure the relationship between each principal component and target variable (Table 1). We’ll use this information later for feature engineering.
Credit Card Fraud Detection Approaches
Now we’ll apply different outlier detection techniques to the data, evaluate their accuracy and performance, and compare them. For credit card fraud detection, we’ll use both supervised and unsupervised methods.
Statistical approaches were the earliest algorithms used for outlier detection13. We’ll start from outlier-detection-specific algorithms―multivariate outlier detection and BACON* outlier detection. A great advantage of these methods is that they are so computationally efficient that it’s easy to apply them to large datasets14. We achieved the best results with six principal components with a high point-biserial correlation coefficient (V11, V12, V14, V16, V17, V18) as features. We used implementations from Intel DAAL and the R* package robustX*. Multivariate outlier detection gave better results for accuracy than BACON outlier detection (Table 2).
Clustering and outlier detection have a complementary relationship. In clustering, the goal is to partition the points into dense subsets. In outlier detection, the goal is to identify points that don’t seem to fit naturally into these dense subsets2. Some clustering algorithms, like DBSCAN*, provide sub-products, which can be considered outliers.
There are many clustering-based approaches to outlier detection2,3,4,15. We’ll apply some of them in our tests. Clustering is inherent in credit card fraud because perpetrators usually produce a group of fraudulent transactions16.
There are two types of outliers, local and global. Different clustering-based algorithms can detect both types. The attribute values of a global outlier are outlying with respect to the values taken by the majority of the data points. The attribute values of a local outlier are extreme when compared to those of its neighbors17.
To detect global outliers, we can perform clustering on a dataset and define some clusters as outliers18 (Table 3). We’ll perform clustering with k-means and EM, treating small clusters as outliers19. Values of k and n_components were obtained by grid search from range [2, 3, …, 50]. We used the same initial centroids for k-means for Intel DAAL and Scikit-learn*, obtaining them with k-means++ initialization using the implementation in Intel DAAL. We achieved the best results using 17 principal components with high point-biserial correlation coefficients as features. With these methods, we achieved a relatively low false positive rate.
We’ll also apply clustering techniques that can detect local outliers20. A naïve approach is to cluster the transactions using the k-means algorithm21 and treat points far from cluster centers as outliers (Table 4).
We obtained a good rate of detections. However, the number of false alarms is high. The reason is that the k-means algorithm is extremely sensitive to outliers, which have a large impact on cluster configuration. As a result, data points that should be declared as outliers are masked by the clustering. Thus, we need to use a robust version of k-means that can gracefully handle the presence of outliers―k,l-means15 (Table 5). Like k-means, this algorithm requires initial centroids. We’ll compute them with different initialization algorithms available in Intel DAAL―random, k-means++22, and k-means||23―and compare the results.
We can see that this approach provides better accuracy than naïve k-means. Results are different for different initializations. The random method proved to be the best for this problem.
We’ll also use a density-based clustering algorithm that can detect noise in data (Table 6). A point is detected as noise if it differs significantly from its neighbors. A natural assumption is to treat noise data as local outliers. We used the same six features as for statistical methods. Values of parameters were obtained by grid search. We used the R package dbscan*. We reached a good level of detected frauds with a moderate level of false alarms. However, the performance is significantly worse compared to previous approaches.
Supervised learning methods require training data that contain examples of each class—in our case, normal and fraudulent transactions2. We’ll divide our dataset into training and evaluation sets, with 80 percent of the data used for training and 20 percent used for model evaluation.
We start with the neural network (Table 7) proposed in the Kaggle kernel24. A simple multi-layer perceptron with three hidden layers was used by the kernel author. Multi-layered perceptrons (MLPs) are widely used neural networks for classification as well as outlier detection25.
Another supervised learning algorithm used in outlier detection is logistic regression16 (Tables 8 and 9). We’ll use 17 attributes with the highest point-biserial correlation coefficients (Table 1) and the Amount feature, which was pre-processed with Z-score normalization. From the histograms24, we can see that fraud and normal observation are linearly separable due to the use of logistic regression. Our dataset is unbalanced, so use of random undersampling could help to build a model that detects frauds more accurately. This omits many non-fraudulent examples from training, so we’ll use a voting classifier26, with each estimator trained on a randomly under-sampled training dataset. (The regularization parameter, C, was set to one after experimentation with a wide range of values.) With this technique, we improved the detection rate by 3 percent with almost the same level of false alarms.
Let’s see how support vector machines (SVM), another supervised learning algorithm which can also be used for outlier detection16, works on the same data. We run SVM with 100,000 iterations, a linear kernel, and C=1. (These parameters were obtained with a grid search.) The same features were used as for logistic regression. Use of SVM ensembles improves the detection rate significantly while giving a low false alarm rate (Table 10).
There are a variety of unsupervised approaches for fraud detection. We applied a few of them in this article. Outlier detection-specific algorithms, k,l-means and DBSCAN, achieved the best detection rates, reaching 85 percent of frauds detected, while use of k-means to detect clusters of fraudulent transactions gives the lowest number of false alarms. The sooner fraudulent transactions are detected, the more losses can be avoided by stopping transactions made with fraudulent credit cards16―making the performance of the algorithm another key metric. The final choice of fraud detection approach will depend on many factors, including the cost of the fraudulent behavior detected and the cost associated with stopping it7. Unsupervised methods can be useful in situations where there is no prior knowledge of classes of observations.
When we have prior knowledge of classes of observations, we can use supervised learning algorithms. We used techniques for unbalanced classification, like undersampling, to help increase the number of frauds detected. Ensembles of SVM classifiers give the highest detection rates, while random forests have the lowest false alarm rates. Although supervised methods like SVM gave us significantly better accuracy compared to unsupervised ones, use of them can pose problems:
As we’ve shown, choosing the right machine learning algorithm for a given application is a non-trivial problem that requires a lot of thought.