Data Analysis for detecting Credit Card Fraud

9 min readOct 1, 2022

This is the peer-graded assignment I did for the course completion of “Introduction to Data Analytics” on Coursera. These are my learning from different resources.

Background:

Financial fraud is increasing significantly with the development of modern technology and the global superhighways of communication, resulting in the loss of billions of dollars worldwide each year. Companies and financial institutions lose huge amounts due to fraud and fraudsters, who continuously try to find new rules and tactics to commit illegal actions. According to the 2018 report by IntSights Cyber Intelligence, there is a yearly increase of 135% in bank data for sale on dark web black markets. Another report by Nilson from December 2021 predicts that by 2030 when the total volume on all payment cards is expected to reach $79.140 trillion, fraud losses are projected to be $49.32 billion! It is also projected that over the next 10 years, card industry losses to fraud will collectively amount to $408.50 billion.

What is fraud analytics?

Fraud analytics combines the use of big data analysis techniques with human interaction to help detect potentially improper transactions, such as those based on fraud and/or bribery, either before the transactions are completed or after they occur.

Its process involves gathering and storing relevant data and mining it for patterns, discrepancies, and anomalies. These findings are then translated into insights that can help financial organizations predict future fraudulent behavior, and help them apply fast detection and mitigation of fraudulent activity in real-time. Fraud analytics has a wide range of benefits as it helps in uncovering new patterns, trends, fraudulent schemes, and scenarios that traditional approaches miss, it adds an extra layer of security to already existing efforts, and it helps in measuring and improving the performance of financial organizations and systems.

Fraud Analytics is key to financial fraud risk management.

There is a wealth of data available to financial organizations that can be used to predict and detect financial fraud and adapt to new threats. Collecting usernames and passwords at login is no longer sufficient to guard against fraudulent activity. When someone accesses or attempts to access an account there are other data that can be used to determine whether or not this is a legitimate customer and whether or not the transaction requested is legitimate. This data can include:

What device are they using?
Has this device been previously registered with the bank?
Can they verify their identity with a fingerprint?
Does the transaction being requested fit their historical patterns?

Answering all these questions requires accessing and analyzing big data.

Here’s the case study provided by Coursera.

Case Study

Using Data Analysis for Detecting Credit Card Fraud:

Companies today are employing analytical techniques for the early detection of credit card fraud, a key factor in mitigating fraud damage. The most common type of credit card fraud does not involve the physical stealing of the card, but that of credit card credentials, which are then used for online purchases.

Imagine that you have been hired as a Data Analyst to work in the Credit Card Division of a bank. And your first assignment is to join your team in using data analysis for the early detection and mitigation of credit card fraud.

In order to prescribe a way forward, that is, suggest what should be done in order for fraud to get detected early on, you need to understand what a fraudulent transaction looks like. And for that, you need to start by looking at historical data.

Here is a sample data set that captures the credit card transaction details for a few users.

Descriptive techniques of analysis, that is, techniques that help you gain an understanding of what happened, including the identification of patterns and anomalies in data. Anomalies signify a variation in a pattern that seems uncharacteristic, or, out of the ordinary. Anomalies may occur for perfectly valid and genuine reasons, but they do warrant an evaluation because they can be a sign of fraudulent activity.

Past studies have suggested that some of the common events that you may need to watch out for include:

A change in frequency of orders placed, for example, a customer who typically places a couple of orders a month, suddenly makes numerous transactions within a short span of time, sometimes within minutes of the previous order.
Orders that are significantly higher than a user’s average transaction.
Bulk orders of the same item with slight variations such as color or size — especially if this is atypical of the user’s transaction history.
A sudden change in delivery preference, for example, a change from home or office delivery address to in-store, warehouse, or PO Box delivery.
A mismatched IP Address, or an IP Address that is not from the general location or area of the billing address.

Before you can analyze the data for patterns and anomalies, you need to:

Identify and gather all data points that can be of relevance to your use case. For example, the card holder’s details, transaction details, delivery details, location, and network are some of the data points that could be explored.
Clean the data. You need to identify and fix issues in the data that can lead to false or incomplete findings, such as missing data values and incorrect data. You may also need to standardize data formats in some cases, for example, the date fields.

Finally, when you arrive at the findings, you will create appropriate visualizations that communicate your findings to your audience. The graph below samples one such visualization that you would use to capture a trend hidden in the sample data set shared earlier on in the case study.

In the next section you will be asked to answer the following 5 (five) questions based on this case study:

List at least 5 (five) data points that are required for the analysis and detection of credit card fraud. (3 marks)

MY SUBMISSION:

IP Addresses: The IP address effectively pinpoints the host’s network location, making it possible to route data directly to them. IP addresses can hold users accountable for their actions on the internet because ISP will know which IP addresses were assigned to their customers at any given time and that’s why fraudsters know that concealing their real IP address is the first and most important to not getting caught. IP addresses provide the absolute geographic location of the computer from which the order is made in real-time e-commerce transactions, which can identify the user’s exact location or calculate the distance between the billing address of online buyers and the actual location of entering the orders. By making sure the IP address country and the billing address country are the same we can authenticate the transactions. Also, it requires closer inspection of the orders that are being shipped to an international address, more attention has to be paid if the card or the shipping address is in an area prone to credit card fraud. Usually, the shipping address is fixed for users, if there’s an address listed that is different than usual, there might be a possibility that someone else is using the account.
Transaction ID, Transaction Date, &Transaction Time: Transaction time helps find the average transaction time users usually make between different transactions, and how often transactions are carried out in a day or month. Transaction Identification number, date, and time of the transaction along with information like time since the last transaction, previous amount of transaction, and previous country of the transaction we can group the transactions made during the last given number of hours, or by card or account number, then by transaction type, merchant group, etc to understand the customer’s spending behavior.
Transaction Values: Banks usually treat the transaction amount as an alert to any critical transaction, the amount used in most banks to measure the weight of the total transaction performed. Source, destination, and amount all combine to act as an alert i.e. already pre-defined based on the bank’s policy. Large transaction values will affect the importance of the transaction itself. This helps in prioritizing the transactions that have larger transaction values over the average transaction value of the user.
User ID & account number: This is a unique identifier for the owner composed of numbers, letters, or other characters, and is assigned to the account owner for ease of reference in a financial institution’s accounting records. The account number is used to route a request for financial authorization to the correct owner before granting access, thereby facilitating various other types of commercial transactions. The user ID and account number can be used by fraudsters along with other sensitive information such as expiration date, secure code, etc.
Units Purchased: Units purchased along with other information can help in identifying user spending habits, which can be then compared if there is any unusual buying.

2. Identify 3 (three) errors/issues that could impact the accuracy of your findings, based on a data table provided. (3 marks)

MY SUBMISSION

Empty values in IP address: IP address is crucial to find out where the transaction is taking place, for example in row 5, column 1, the user ID johnp has made two purchases from the electronics category with a time difference of 4 minutes but it is difficult to guess whether there has been any fraud without the availability of an IP address.
Empty value in transaction value: In row 3, column 8, there’s no value present, which would affect the accuracy of the prediction of fraud. As the value in row 4, column 8 is much greater than the values in rows 1,2 column 8. Also, the values in row 1 and row 2 are similar as the IP address is the same and the shipping address is the same but it is difficult to predict if there has been any fraudulent activity without knowing what was the amount of the transaction.
Non-uniform date format. In row 5 of the Transaction Date column the date format is different from row 4 as both the rows share the same date, time, and shipping address. This makes it difficult to read and analyze the data, it requires cleaning data or preprocessing the data to make it in a consistent format. The transactions of the user ID johnp are not arranged according to the dates whereas another user ID is.

3. Identify 2 (two) anomalies or unexpected behaviors, that would lead you to believe the transaction may be suspect, based on a data table provided. (2 marks)

MY SUBMISSION

Here are two anomalies that become my suspicion of credit card fraud:

In the 13th row i.e. the user ID ellend, the previous two transactions have the same IP address, and same shipping address but in her last purchase the shipping address is changed from P.O.Box 1322 to P.O.Box 5401 and the IP address also differs, along with the increase from the average transaction value.
In the 5th row i.e. the user ID johnp, and the last transaction i.e. the 4th row both of them have the same shipping address, same product category, really close transaction time, and same date but different date format. And there’s an increase in the transaction value from earlier values.

4. Briefly explain your key take-away from the provided data visualization chart. (1 mark)

The data visualization show that two users with ID johnp and ellend have witnessed fraud as there’s a sudden spike in the transaction amount for both of them which can be accounted for empty values of IP Address and different date format for johnp and different IP Address and different shipping address for the user Id ellend.

5. Identify the type of analysis that you are performing when you are analyzing historical credit card data to understand what a fraudulent transaction looks like. [Hint: The four types of Analytics include: Descriptive, Diagnostic, Predictive, and Prescriptive] (1 mark)

MY SUBMISSION:

I used two out of four analytics, namely descriptive analysis, and diagnostic analysis. I used descriptive analysis to explain what happened, i.e. two user accounts have witnessed fraud. And I used diagnostic analysis to find the cause of the fraud by finding anomalies in the data provided i.e. lack of IP address, change in the shipping address, and non-uniform date formats.

Final Word

The challenges to recognizing fraudulent credit card transactions are:

Enormous data is generated and collected by different financial institutions which makes it difficult to analyze and process, it is a time-consuming process.
Imbalance data, i.e. majority of the transactions are not fraudulent which makes it difficult to understand the patterns and detect the actual ones.
Many institutions keep the data private due to Personally Identifiable Information.
Non-uniform, misclassified, lack of data and incomplete data are some of the many reasons why some fraudulent transactions are never caught and reported.
New procedures and tactics used by fraudsters make it difficult to follow one set of rules to identify the fraud transactions.

Thus data analysis plays a key role in financial fraud detection.

Data Analysis for detecting Credit Card Fraud

Case Study

Written by Aliya Fatima