Top excel formulas for accountants
One of the most powerful data processing tools used in accounting today is Microsoft Excel. Around since 1985, Excel was designed to …
Unsupervised machine learning is a type of machine learning where algorithms analyse and interpret data without explicit instructions. Since unsupervised models learn from unlabelled input data, they are free to uncover hidden patterns or relationships without human instruction.
Unsupervised machine learning differs from supervised learning, where algorithms are trained to predict an outcome based on labelled input data. In other words, each input has a corresponding output that supervised models must learn to predict.
With 90% of the world’s data generated in just the last two years, unsupervised models have a crucial role to play in making sense of vast and complex datasets from fields such as IoT and social media.
In finance, unsupervised machine learning is utilised in risk management, fraud detection, customer segmentation and document processing, among many other applications.
Generally speaking, algorithms employ one of three techniques to identify patterns and relationships in the data.
One of the most widely used techniques, clustering involves grouping a set of objects such that the objects in one group (or cluster) more closely resemble each other than they do the objects in another group.
Four types of unsupervised machine learning algorithms are used in clustering: exclusive, probabilistic, hierarchical and overlapping.
Exclusive algorithms group data such that a single data point can only be assigned to one cluster. This means that clusters are mutually exclusive and there is no overlap between them.
Businesses may use exclusive algorithms to segment their customers based on purchase behaviour or location.
As the name implies, probabilistic algorithms assign data points to clusters based on probability distributions. They allow for the possibility that one data point belongs to multiple clusters.
Probabilistic algorithms can be used to assess the creditworthiness of a loan applicant, with each segmented according to their credit history, financial behaviour and demographic information.
Rather than assigning the applicant to a single risk category, the unsupervised algorithm can provide a distribution of probabilities across multiple risk levels.
In hierarchical clustering, the unsupervised learning algorithm creates a tree-like structure that represents nested groupings of data points at various levels of granularity.
This structure is called a dendrogram, with each node representing a cluster of data points and each branch showing how clusters are connected or split.
The most common hierarchical approach is agglomerative (bottom-up) clustering. Every data point starts as a cluster before the closest pairs of clusters are iteratively merged (based on similarities) until a predetermined number of clusters remains.
Overlapping algorithms place data points into multiple clusters simultaneously and tend to be best suited to scenarios where:
For example, a customer could be segmented into both a “luxury buyer” and an “eco-conscious consumer” cluster.
Unsupervised learning models can also employ rules to unearth associations, connections or correlations between data points.
The association rules technique is often used in a market basket analysis to help companies understand items that are frequently purchased together.
The results of the analysis then form the basis of common eCommerce recommendation engines that start with the message “People who purchased this item also purchased”.
Banks can also use association rule-based algorithms to detect fraud in transaction data. If a specific combination of transactions (such as one small and one large) tends to precede fraud, the bank can flag the incident for further investigation.
As a general rule, more data in machine learning yields better results. Sometimes, however, unsupervised models cannot draw useful conclusions from datasets where the number of features or dimensions is too high.
Algorithms that employ the dimensionality reduction technique extract the most important features from a dataset. By extension, this reduces the number of irrelevant or random features.
One algorithm used for this purpose is the principal component analysis (PCA). The PCA is often used to prepare or analyse data for modelling and seeks to extract the most important features without compromising important properties in the dataset itself.
Suppose a bank wants to improve its credit scoring model for mortgage applicants and better predict whether the applicant will default.
However, with hundreds of features in the dataset such as income and credit history, the bank observes that the model is too complex and difficult to manage.
It can then use PCA to identify redundant or irrelevant features and reduce the dataset to 20 principal components to build its credit scoring system.
We already touched on some applications of unsupervised learning in the previous section, so let’s expand on these and detail some additional applications across different contexts.
Unsupervised learning is widely used in customer segmentation where customers are grouped based on their purchase behaviour, preferences, income and various other demographic metrics.
By identifying distinct customer segments, companies can craft tailored marketing strategies, make personalised offers and optimise product recommendations.
Customer segmentation can also be used to predict customer churn. In 2010, American Express upgraded its traditional database technology to work in conjunction with machine learning algorithms.
With an enormous amount of historical transaction data at its disposal, American Express was able to create an unsupervised model to predict customer churn.
The model assesses 115 customer behaviours that forecast potential churn, and in 2018, the company claimed it could predict 24% of Australian customers who would close their accounts in the next four months.
Unsupervised machine learning is particularly useful for anomaly detection since models identify outliers in the data without being trained on what constitutes an anomaly.
Though not an exhaustive list, some anomalies in the context of finance include:
Experian sells an enterprise credit decisioning platform built on machine learning. The platform enables customers to make faster and more accurate evaluations of the creditworthiness of an applicant.
Experian’s credit decisioning engine processes data about the borrower such as employment and identity information to look for anomalies that could potentially predict the risk of the customer defaulting.
Related to anomaly detection is fraud detection – one of the most common and indeed important applications of unsupervised learning in finance.
These models analyse vast amounts of historical transaction data to establish a baseline of normal behaviour.
The model can then detect anomalies or outliers that deviate substantially from normal behaviour and can even detect new types of fraud – even if the patterns that represent them have not been described or identified.
PayPal uses anomaly detection to analyse the more than 15 billion annual transactions on its platform for fraud.
As part of its Fraud Protection service, the platform offers intelligence based on both the consumer and merchant side of the transaction. Customers can also protect their business from threats with machine learning-based filters that score the riskiness of a transaction based on historic fraud trends.
PayPal’s Chargeback Protection also relies on machine learning to evaluate transactions in real time and then decide whether to approve or reject them. Aside from fraud protection, this service reduces instances of false declines and payment disputes.
Natural language processing is another unsupervised algorithm application. In this context, the objective is to analyse, understand and derive meaning from human language datasets.
To that end, models can detect anomalies, extract important keywords or phrases and identify the main topics.
NLP is typically associated with language translation and speech recognition in conversational interfaces. But it can also be used by finance companies for various other purposes.
JPMorgan Chase’s proprietary Contract Intelligence (COiN) system utilises NLP to review and process commercial loan agreements.
Here, the benefits of unsupervised machine learning are clear. Before the system was introduced, the analysis of these documents was monotonous, labour-intensive and prone to human error.
With COiN at the helm, documents can be reviewed in just a few sections and the error rate is near-zero. In a recent internal report, the company noted that COiN now processes 12,000 commercial credit agreements each year and saves around 360,000 hours of human labour in the process.
While undoubtedly powerful and versatile, unsupervised machine learning does come with its own unique challenges.
One obvious drawback of the unsupervised approach is the lack of ground truth, which refers to the accurate, real-world data that is used to objectively assess the model’s performance.
Without this reference point, it is difficult to ascertain how well an unsupervised model has learned the patterns in the data. Researchers can use visual inspection or assess certain metrics to gauge performance, but these are often subjective measures at best.
The curse of dimensionality is a term coined by mathematician Richard Bellman. It refers to the various problems that arise when models analyse datasets with lots of features or variables.
In short, an increase in either makes it more difficult for models to find meaningful patterns of relationships in the data. Some models can become unnecessarily complex and identify useless patterns from noise in a phenomenon known as overfitting.
Some unsupervised models, as noted earlier, can overcome this problem by extracting the most important features with principal component analysis (PCA).
However, selection of the correct parameters and methods for dimensionality reduction can be problematic. What’s more, the process is computationally expensive.
Some unsupervised models also suffer from interpretability. This means it is difficult for users to understand their decision-making process.
Like the curse of dimensionality, interpretability is a problem associated with complex, feature-rich datasets. It is also, to some extent, a problem inherent to unsupervised models with unlabelled data.
It’s important to note that while models can easily uncover hidden patterns and structures in the data, what those patterns and structures represent is sometimes unclear.
Interpretability in unsupervised learning is an active area of research, but there are some ways to reduce its impact.
Researchers should have a reasonable idea of what they want to achieve with the data. Visualisation of the data with techniques such as scatter plots and dendrograms (as in hierarchical clustering) is also effective.
Summary:
One of the most powerful data processing tools used in accounting today is Microsoft Excel. Around since 1985, Excel was designed to …
A data breach occurs when an unauthorized user gains entry into a system and steals sensitive information like payment records, personal data, …
An Internet Protocol (IP) Address is a unique set of numbers that is attached to the internet activity of a certain computer …
End-to-end B2B payment protection software to mitigate the risk of payment error, fraud and cyber-crime.