During the last Data Science community meeting in Luxembourg Phd Candidate Alejandro Correa Bahnsen made a presentation on Credit Card Fraud Detection(CCFD). In short - CCFD is just another machine learning problem which is similar to Network Security and Intrusion Detection(NSID) problem, but it has its own obstacles.
From business perspectives - card fraud detection system directly impacts profitability of all credit card operators, therefore are very desirable. The cost of card fraud in the UK alone in 2006 was estimated ~ 500 millions pounds. Assuming, that CCFD system can identify 30% (Churn models at TelComs are able to save that much) of all fraud would lead to 150 millions pound savings a year. Hence, we have a desirable product with price range from 1.5 to 15 millions pounds in the UK. Here is the catch - in any given country there are a few credit card operators, for ex. only CETREL operates in Luxembourg. So, if you want to sell a solution you play all or nothing game.
Additionally, if you wish to build a CCFD system or at least a prototype you most likely missing the data and you won’t get it. Chicken and egg problem.
How does the data set might look? Think of millions rows where less than 0.002% of data are the fraud operations. If you train your model with unadjusted data set, it will predict all future events as normal operations. In order to avoid such behavior you need to throw away most of the data with normal operations and balance data where distribution would be 1% of fraud vs 99% normal or 5% vs 95%. You can play with freely available network intrusion data to get idea how imbalanced data would look like.
Another thing to keep in mind is the size of data set. Conventional wisdom - the more data you have, the better model you can build, but it is not true if you run a real time system, where latency is big deal. In such case you have to think about something similar to map-reduce framework, where you keep only the averages of the variables per client.
CCFD systems work as binary classificators, where the response either fraud or normal operation, meaning that it doesn’t take into account how much fraud cost. Alejandro tries to incorporate loss-profit function, where each operation has its own cost. If you think about his approach - sounds as regression problem to me.
And the last thing - I suppose, it’s worth to try to run unsupervised learning system in parallel. Unsupervised CCFD would issue a lot false alerts at the beginning, however it would considerably improve over time with good feedback from supervised CCFD.