Overfitting

Isabell Hamecher
March 20, 2026
4 min read
Familiarise yourself with what overfitting means in an AI context and how it impacts model performance.

Definition

Overfitting in machine learning occurs when a model learns the details and noise of the training data to the extent that it negatively impacts its performance on new, unseen data. Essentially, the model becomes too complex, capturing patterns that do not generalize well to other datasets.

Understanding overfitting

In machine learning, overfitting occurs when a model matches its training data too closely, sometimes almost exactly, and then fails to make reliable predictions on new data. Instead of learning the true underlying patterns, the model starts memorising the noise, irrelevant details, or random quirks in the training dataset. This defeats the entire purpose of machine learning. The real value of a model lies in generalisation, its ability to apply what it has learned to data it hasn’t seen before. An overfitted model is a bit like an invention that works perfectly in the lab but is useless in the real world.

How overfitting happens

Overfitting usually appears when:

  • A model trains for too long on the same data
  • The model is too complex, with too many parameters or features
  • The training data does not properly represent real-world data

As training continues, the model may start learning random patterns that have no real meaning. Performance on the training set improves, but performance on new data gets worse.

Spotting the warning signs

There are some tell-tale indicators:

  • Low error on training data
  • High error on test or validation data
  • High variance in predictions

How to spot overfitting

1. A common detection method is to set aside part of the data as a test set. If the model performs well on training data but poorly on the test set, overfitting is likely.

2. Another approach is k-fold cross-validation, where data is split into several folds and the model is repeatedly trained and tested on different portions. The results are averaged to judge overall performance.

3. Loss curves can also reveal trouble. If training loss keeps falling but validation loss starts rising, the model is probably overfitting.

Overfitting and underfitting

Overfitting is not the only training problem. The opposite issue is underfitting.

An overfitted model learns the noise in the training data. It performs very well on the training set but poorly on new, unseen data. These models typically show low bias and high variance.

An underfitted model, on the other hand, has not learned enough from the training data. It performs poorly even on the training set because it has failed to capture the main patterns. Underfitted models usually show high bias and low variance.

This balance between bias and variance is known as the bias–variance trade-off. The goal is to find the sweet spot where the model captures the dominant trend in the data without memorising noise.

Why overfitting is a problem

  • Leads to poor performance
  • Creates models that are unnecessarily complex and hard to reuse
  • Causes the need for extra data collection that adds cost and risk
  • In some cases, it may lead to the possibility of reconstructing personal details from training data, which may include sensitive information

Ways to reduce overfitting

Several techniques that aim to simplify the model or improve its ability to focus on meaningful patterns can help: 

  • Early stopping, halting training before the model starts learning noise
  • More relevant data, expanding the dataset with clean, meaningful examples
  • Data augmentation, adding variation carefully to improve stability
  • Feature selection, removing redundant or irrelevant inputs
  • Regularisation, applying penalties to large coefficients to limit model variance
  • Ensemble methods, combining multiple models such as in bagging or boosting to reduce variance

Key takeaways

  • Overfitting happens when a model memorises training data instead of learning general patterns
  • It leads to poor performance on new, unseen data
  • Signs include low training error but high test error and rising validation loss
  • It is linked to high variance and forms part of the bias–variance trade-off
  • Techniques such as early stopping, feature selection, regularisation, and ensemble methods help reduce the risk
  • The ultimate goal of any model is good generalisation, not perfection on the training set

Related Terms

No items found.