Finance glossary

What is unsupervised machine learning?

Bristol James
8 Min

Unsupervised machine learning is a type of machine learning where algorithms analyse and interpret data without explicit instructions. Since unsupervised models learn from unlabelled input data, they are free to uncover hidden patterns or relationships without human instruction.

Unsupervised machine learning differs from supervised learning, where algorithms are trained to predict an outcome based on labelled input data. In other words, each input has a corresponding output that supervised models must learn to predict.

Unsupervised machine learning compared to machine learning
A comparison of how supervised and unsupervised machine learning works in practice (Source: Langs and Hofmanninger 2018)

With 90% of the world’s data generated in just the last two years, unsupervised models have a crucial role to play in making sense of vast and complex datasets from fields such as IoT and social media.

In finance, unsupervised machine learning is utilised in risk management, fraud detection, customer segmentation and document processing, among many other applications.

Unsupervised learning techniques

Generally speaking, algorithms employ one of three techniques to identify patterns and relationships in the data.

1 – Clustering

One of the most widely used techniques, clustering involves grouping a set of objects such that the objects in one group (or cluster) more closely resemble each other than they do the objects in another group.

Four types of unsupervised machine learning algorithms are used in clustering: exclusive, probabilistic, hierarchical and overlapping.

Exclusive

Exclusive algorithms group data such that a single data point can only be assigned to one cluster. This means that clusters are mutually exclusive and there is no overlap between them.

Businesses may use exclusive algorithms to segment their customers based on purchase behaviour or location.

Probabilistic

As the name implies, probabilistic algorithms assign data points to clusters based on probability distributions. They allow for the possibility that one data point belongs to multiple clusters.

Probabilistic algorithms can be used to assess the creditworthiness of a loan applicant, with each segmented according to their credit history, financial behaviour and demographic information.

Rather than assigning the applicant to a single risk category, the unsupervised algorithm can provide a distribution of probabilities across multiple risk levels.

Hierarchical

In hierarchical clustering, the unsupervised learning algorithm creates a tree-like structure that represents nested groupings of data points at various levels of granularity.

This structure is called a dendrogram, with each node representing a cluster of data points and each branch showing how clusters are connected or split.

The most common hierarchical approach is agglomerative (bottom-up) clustering. Every data point starts as a cluster before the closest pairs of clusters are iteratively merged (based on similarities) until a predetermined number of clusters remains.

Overlapping

Overlapping algorithms place data points into multiple clusters simultaneously and tend to be best suited to scenarios where:

  • The boundaries between clusters are poorly defined, and
  • The data points naturally belong to multiple categories.

For example, a customer could be segmented into both a “luxury buyer” and an “eco-conscious consumer” cluster.

2 – Association rules

Unsupervised learning models can also employ rules to unearth associations, connections or correlations between data points.

The association rules technique is often used in a market basket analysis to help companies understand items that are frequently purchased together.

The results of the analysis then form the basis of common eCommerce recommendation engines that start with the message “People who purchased this item also purchased”.

Banks can also use association rule-based algorithms to detect fraud in transaction data. If a specific combination of transactions (such as one small and one large) tends to precede fraud, the bank can flag the incident for further investigation.

3 – Dimensionality reduction

As a general rule, more data in machine learning yields better results. Sometimes, however, unsupervised models cannot draw useful conclusions from datasets where the number of features or dimensions is too high.

Algorithms that employ the dimensionality reduction technique extract the most important features from a dataset. By extension, this reduces the number of irrelevant or random features.

One algorithm used for this purpose is the principal component analysis (PCA). The PCA is often used to prepare or analyse data for modelling and seeks to extract the most important features without compromising important properties in the dataset itself.

Suppose a bank wants to improve its credit scoring model for mortgage applicants and better predict whether the applicant will default.

However, with hundreds of features in the dataset such as income and credit history, the bank observes that the model is too complex and difficult to manage.

It can then use PCA to identify redundant or irrelevant features and reduce the dataset to 20 principal components to build its credit scoring system.

Unsupervised learning applications

We already touched on some applications of unsupervised learning in the previous section, so let’s expand on these and detail some additional applications across different contexts.

Customer segmentation

Unsupervised learning is widely used in customer segmentation where customers are grouped based on their purchase behaviour, preferences, income and various other demographic metrics.

By identifying distinct customer segments, companies can craft tailored marketing strategies, make personalised offers and optimise product recommendations.

Example

Customer segmentation can also be used to predict customer churn. In 2010, American Express upgraded its traditional database technology to work in conjunction with machine learning algorithms.

With an enormous amount of historical transaction data at its disposal, American Express was able to create an unsupervised model to predict customer churn.

The model assesses 115 customer behaviours that forecast potential churn, and in 2018, the company claimed it could predict 24% of Australian customers who would close their accounts in the next four months.

Anomaly detection

Unsupervised machine learning is particularly useful for anomaly detection since models identify outliers in the data without being trained on what constitutes an anomaly.

Though not an exhaustive list, some anomalies in the context of finance include:

  • Transaction anomalies – such as those that occur in unexpected locations.
  • Behavioural anomalies – for example, a sudden increase in luxury purchases.
  • Credit anomalies – where there are unexplained or significant changes to a customer’s credit score or credit utilisation.

Example

Experian sells an enterprise credit decisioning platform built on machine learning. The platform enables customers to make faster and more accurate evaluations of the creditworthiness of an applicant.

Experian’s credit decisioning engine processes data about the borrower such as employment and identity information to look for anomalies that could potentially predict the risk of the customer defaulting.

Fraud detection

Related to anomaly detection is fraud detection – one of the most common and indeed important applications of unsupervised learning in finance.

These models analyse vast amounts of historical transaction data to establish a baseline of normal behaviour.

The model can then detect anomalies or outliers that deviate substantially from normal behaviour and can even detect new types of fraud – even if the patterns that represent them have not been described or identified.

Example

PayPal uses anomaly detection to analyse the more than 15 billion annual transactions on its platform for fraud.

As part of its Fraud Protection service, the platform offers intelligence based on both the consumer and merchant side of the transaction. Customers can also protect their business from threats with machine learning-based filters that score the riskiness of a transaction based on historic fraud trends.

PayPal’s Chargeback Protection also relies on machine learning to evaluate transactions in real time and then decide whether to approve or reject them. Aside from fraud protection, this service reduces instances of false declines and payment disputes.

Natural language processing (NLP)

Natural language processing is another unsupervised algorithm application. In this context, the objective is to analyse, understand and derive meaning from human language datasets.

To that end, models can detect anomalies, extract important keywords or phrases and identify the main topics.

NLP is typically associated with language translation and speech recognition in conversational interfaces. But it can also be used by finance companies for various other purposes.

Example

JPMorgan Chase’s proprietary Contract Intelligence (COiN) system utilises NLP to review and process commercial loan agreements.

Here, the benefits of unsupervised machine learning are clear. Before the system was introduced, the analysis of these documents was monotonous, labour-intensive and prone to human error.

With COiN at the helm, documents can be reviewed in just a few sections and the error rate is near-zero. In a recent internal report, the company noted that COiN now processes 12,000 commercial credit agreements each year and saves around 360,000 hours of human labour in the process.

What are the main challenges with unsupervised learning?

While undoubtedly powerful and versatile, unsupervised machine learning does come with its own unique challenges.

Overview of the challenges of unsupervised machine learning
An overview of some of the challenges in unsupervised learning (Source: FITA)

A lack of ground truth

One obvious drawback of the unsupervised approach is the lack of ground truth, which refers to the accurate, real-world data that is used to objectively assess the model’s performance.

Without this reference point, it is difficult to ascertain how well an unsupervised model has learned the patterns in the data. Researchers can use visual inspection or assess certain metrics to gauge performance, but these are often subjective measures at best.

The curse of dimensionality

The curse of dimensionality is a term coined by mathematician Richard Bellman. It refers to the various problems that arise when models analyse datasets with lots of features or variables.

In short, an increase in either makes it more difficult for models to find meaningful patterns of relationships in the data. Some models can become unnecessarily complex and identify useless patterns from noise in a phenomenon known as overfitting.

Some unsupervised models, as noted earlier, can overcome this problem by extracting the most important features with principal component analysis (PCA).

However, selection of the correct parameters and methods for dimensionality reduction can be problematic. What’s more, the process is computationally expensive.

Interpretability

Some unsupervised models also suffer from interpretability. This means it is difficult for users to understand their decision-making process.

Like the curse of dimensionality, interpretability is a problem associated with complex, feature-rich datasets. It is also, to some extent, a problem inherent to unsupervised models with unlabelled data.

It’s important to note that while models can easily uncover hidden patterns and structures in the data, what those patterns and structures represent is sometimes unclear.

Interpretability in unsupervised learning is an active area of research, but there are some ways to reduce its impact.

Researchers should have a reasonable idea of what they want to achieve with the data. Visualisation of the data with techniques such as scatter plots and dendrograms (as in hierarchical clustering) is also effective.

Summary:

  • Unsupervised machine learning is a type of machine learning that deals with unlabelled data. Algorithms are provided with input data (but not the corresponding output labels) and tasked with discovering hidden patterns, structures and relationships.
  • There are three main unsupervised learning techniques. Clustering groups similar data points with various ways to do so, while association rules identify relationships between variables. Dimensionality reduction seeks to simplify datasets by reducing the number of features.
  • In finance, the applications of unsupervised machine learning are as diverse as they are impressive. JPMorgan Chase, for example, used unsupervised models to streamline document review and save thousands of hours of manual labour in the process.
  • Despite their benefits, some challenges persist with unsupervised machine learning. Without a real reference point, it can be hard to objectively judge model performance. The curse of dimensionality and model interpretability are also issues that arise from large or complex datasets.

Related articles

Finance glossary

What is MFA?

Multi-factor authentication (MFA) is a security method that requires users to prove their identity using two or more distinct factors before accessing …

Read more
Finance glossary

What are imposter scams?

Imposter scams are a type of fraud where scammers pretend to be trusted individuals, companies, or government agencies to deceive victims into …

Read more
Finance glossary

What is accounts payable fraud?

Accounts payable fraud is a deceptive practice that exploits vulnerabilities in a company’s payment processes. It occurs when individuals—whether employees, vendors or …

Read more

The new security standard for business payments

Eftsure provides continuous control monitoring to protect your eft payments. Our multi-factor verification approach protects your organisation from financial loss due to cybercrime, fraud and error.