Monday, 19 February 2018

Credit Card Transactions Anomaly Detection

A Practical Guide on Modeling Customer Behavior

Most people think deep neural networks or any other modern machine learning techniques are capable of doing almost anything at least in specific single domains. Even if we consider this is a correct claim, the problem is we usually do not have access to enough data (specially labeled) to train a neural network and expect that magic. That is why most of the successful machine learning projects which catch our attention come from giant companies who have enough money to pay for data and hire hundreds of people to label the data. Now the question is, what can we do if we do not have that much money or labeled data, but still want to or have to use machine learning techniques? This technical white paper is the result of real experience on different projects in medium to large sized companies. Companies that just give us access to millions of transactions per hour and ask to find the anomalies!? Solving these kinds of problems is the real magic! ...

Download the pdf:  Credit Card Transactions Anomaly Detection

Saturday, 17 February 2018

Spark: Accessing the next row

Sometimes when you are processing log files with the Spark, you need to have some data fields of the next (or previous) row in hand. For example, you have data file containing some credit card usage with the following structure:

timestamp, creditCardId, merchantId, amount, ...

Now you want to calculate some velocity measure for every single user, like the minimum time each user usually uses to change the merchant. To solve this problem, we need to have access to each card's transaction and the next one, ordered by time.

OK, first you need to load the data file into a dataset, let us use java syntax to describe the solution:

Dataset dataset = sparkSession.read().parquet("your_data_file"); dataset.createOrReplaceTempView("ds");

Then you need to add the row number to the dataset; we have to build the row number with the order of creditCardId, timestamp as below:

String sql = "select row_number() over (order by creditCardId, timestamp) as rowNum,";
sql += "timestamp, creditCardId, merchantId, amount, ... from ds";
dataset = spark.sql(sql).toDF();

Now the dataset contains unique row numbers ordered by creditCardId and timestamp. What we need to do is joining the dataset with itself using row number like this:

dataset.createOrReplaceTempView("ds1");
dataset.createOrReplaceTempView("ds2");
sql = "select ds1.creditCardId as creditCardId, ";
sql += "ds1.timestamp as timestamp1, ds2.timestamp as timestamp2, ";
sql += "ds1.merchantId as merchantId1, ds2.merchantId as merchantId2";
sql += "ds1.amount as amount1, ds2.amount as amount2, ";
...
sql += "from ds1 join ds2 on ds1.rowNum+1 = ds2.rowNum, ds1.creditCardId = ds2.creditCardId"
dataset = spark.sql(sql).toDF();

At this point each row of the dataset contains the next row's timestamp, merchandId and amount, and you can do any calculation you want.

Sunday, 11 February 2018

On Measuring Abnormality

The effect of time of observation in detecting anomalies or frauds

I have been working on many anomaly and fraud detection systems for five years and what I have found in these years shows that:
  • We can't use a single machine learning or statistical model to detect any abnormal behavior in a system, like an only deep neural network - which people are usually very fond of it! - or a single recurrent network or an autoencoder, etc. Instead, we have to use many classifiers or clustering processes or even statistical models and wire them together to get the result. So basically, it means we cannot do it on a shell scripting environment unless we are facing a simple problem. We need to design and build a real application.
  • Regardless of what algorithm we use to detect the abnormalities, if the algorithm can't give acceptable reasoning to the customer, they are not happy.
Here we try to describe a simplified version of an algorithm that can help you to address this problem ...

Download the pdf:  On Measuring Abnormality