Monday, 19 February 2018

Credit Card Transactions Anomaly Detection

A Practical Guide on Modeling Customer Behavior

Most people think deep neural networks or any other modern machine learning techniques are capable of doing almost anything at least in specific single domains. Even if we consider this is a correct claim, the problem is we usually do not have access to enough data (specially labeled) to train a neural network and expect that magic. That is why most of the successful machine learning projects which catch our attention come from giant companies who have enough money to pay for data and hire hundreds of people to label the data. Now the question is, what can we do if we do not have that much money or labeled data, but still want to or have to use machine learning techniques? This technical white paper is the result of real experience on different projects in medium to large sized companies. Companies that just give us access to millions of transactions per hour and ask to find the anomalies!? Solving these kinds of problems is the real magic! ...

Download the pdf:  Credit Card Transactions Anomaly Detection

Saturday, 17 February 2018

Spark: Accessing the next row

Sometimes when you are processing log files with the Spark, you need to have some data fields of the next (or previous) row in hand. For example, you have data file containing some credit card usage with the following structure:

timestamp, creditCardId, merchantId, amount, ...

Now you want to calculate some velocity measure for every single user, like the minimum time each user usually uses to change the merchant. To solve this problem, we need to have access to each card's transaction and the next one, ordered by time.

OK, first you need to load the data file into a dataset, let us use java syntax to describe the solution:

Dataset dataset ="your_data_file"); dataset.createOrReplaceTempView("ds");

Then you need to add the row number to the dataset; we have to build the row number with the order of creditCardId, timestamp as below:

String sql = "select row_number() over (order by creditCardId, timestamp) as rowNum,";
sql += "timestamp, creditCardId, merchantId, amount, ... from ds";
dataset = spark.sql(sql).toDF();

Now the dataset contains unique row numbers ordered by creditCardId and timestamp. What we need to do is joining the dataset with itself using row number like this:

sql = "select ds1.creditCardId as creditCardId, ";
sql += "ds1.timestamp as timestamp1, ds2.timestamp as timestamp2, ";
sql += "ds1.merchantId as merchantId1, ds2.merchantId as merchantId2";
sql += "ds1.amount as amount1, ds2.amount as amount2, ";
sql += "from ds1 join ds2 on ds1.rowNum+1 = ds2.rowNum, ds1.creditCardId = ds2.creditCardId"
dataset = spark.sql(sql).toDF();

At this point each row of the dataset contains the next row's timestamp, merchandId and amount, and you can do any calculation you want.

Sunday, 11 February 2018

On Measuring Abnormality

The effect of time of observation in detecting anomalies or frauds

I have been working on many anomaly and fraud detection systems for five years and what I have found in these years shows that:
  • We can't use a single machine learning or statistical model to detect any abnormal behavior in a system, like an only deep neural network - which people are usually very fond of it! - or a single recurrent network or an autoencoder, etc. Instead, we have to use many classifiers or clustering processes or even statistical models and wire them together to get the result. So basically, it means we cannot do it on a shell scripting environment unless we are facing a simple problem. We need to design and build a real application.
  • Regardless of what algorithm we use to detect the abnormalities, if the algorithm can't give acceptable reasoning to the customer, they are not happy.
Here we try to describe a simplified version of an algorithm that can help you to address this problem ...

Download the pdf:  On Measuring Abnormality

Friday, 7 April 2017

Time Series Anomaly Detection White Paper

Here is a white paper on time series anomaly detection. This white paper is about finding anomalies in time series, which we encounter in almost every system. I usually keep notes when I work on projects, and this paper is based on my experiences and the notes I took while working on anomaly detection systems. It is a practical guide to detecting anomalies in time series using artificial intelligence concepts.


Thursday, 21 April 2016

Free Port Monitoring Service

One of my friends and I have built a service based on the recent posts of the blog on visualizing computer's port distribution and monitoring the usage. It is like the tools we already had here basically, but with a better UI, easier to use and some more information. I do not know why, but the service is named "Puffin"

Clear as crystal
One of the problems with installing a client that sends data to a server on the cloud is that you usually don't know what information it sends up. Same issue happens even when the software is not supposed to do anything with the Internet. What we have done in Puffin, is using simple shell scripts to send data to the Puffin's back-end service. So you can see the inside of the script with any text editor and make sure what information it sends to the server.

Dashboard page of the service

Who is this service for?
Everybody who is curious or wants to know what his/her computer does while connecting to a network or the Internet. Mostly computer students or geeks, network administrators, technical supports or those who does not trust to installed software and want to know what the installed software are doing with the Internet or network connection. If you are a computer, software, network, ... geek you don't need to read the rest of the post; test it here:

Wednesday, 13 April 2016

Bias and Variance in modeling

It is alway important to remember why we do classification. We do it because we want to build a general model to support our problem, not to model the given training datasets only. Sometimes when you finish training the system and look at your model; you see not all of your training data fits in the model, it does not necessarily means that your model is wrong. You can also find many other examples and cases that the model fits the training data very well but not the test data.

Which one of these three models describes the pattern of
the given training set better? 

Besides, never forget, we use modeling the data because we do not know what exactly happens in the system. We do it because we cannot scientifically and mathematically write a formula to describe the system we are observing. So we should not expect our model completely describes the system, why? Because we have modeled the system just by a small fraction of dataset space.

Friday, 8 April 2016

Entropy pattern of network traffic usage

I am working on a new project; it is about recognizing usage patterns in network traffic by having as minimum as possible data. Usually, you need to do DPI or look for signatures, ... These are good and all work well, but the problem is they require access to low-level traffic (not everybody likes to give low-level traffic for analysis) and also are very CPU intensive, especially when you are dealing with large volume of traffic. I think using these techniques are like recognizing your friend by his fingerprint or DNA! However, you, as a human being can recognize your friend, even after 10 or 20 years without these techniques!

Entropy distribution of Port and IP of the TCP connections

We never do a precise and accurate measurement to recognize a friend or a pattern or an object. We usually do some rough calculation or comparison but over many features. When you do DNA analysis to identify people, you can consider it as a precise and accurate measurement while looking to your friend's face, body, hair, and even his clothing style are just simple comparisons and summarizing the results to get the required information you need, is he your friend or not?