Definition
Adversarial attacks are malicious attempts to deceive AI models by subtly modifying input data. These tiny changes, often imperceptible to humans, can drastically alter the model's predictions.

Types of adversarial attacks
Adversarial attacks target machine learning models by deliberately feeding them deceptive data. These attacks exploit weaknesses in how models process inputs, causing incorrect or unintended outputs. There are several main types of attacks:
- Poisoning attacks: These occur during training, when malicious data is added to the dataset. This corrupts the learning process and can lead to biased or incorrect predictions. For example, mislabelled data can teach a model to ignore certain patterns or behave unpredictably.
- Evasion attacks: These happen at inference time, where inputs are subtly altered to trick a trained model. Slight changes to images, audio, or text can cause a model to misclassify them, even though they appear normal to humans.
- Prompt injection: Large language models can be manipulated through carefully crafted inputs. These prompts influence the model to produce unintended outputs without accessing its internal code.
- Model inversion and membership inference: These attacks use a model's outputs to infer sensitive information about the training data. Attackers can reconstruct data or determine if a particular record was used in training.
- Model stealing: Repeated queries to a model can allow an attacker to replicate its functionality. This compromises intellectual property and can enable further attacks.
Motivations behind attacks
Adversarial attacks have multiple goals depending on the attacker's intent. Common motivations include:
- Bypassing security systems, such as fraud detection or authentication controls.
- Causing models to produce harmful or misleading outputs, undermining trust in AI.
- Stealing intellectual property by replicating proprietary models.
- Accessing sensitive data by exploiting a model to reveal private training information.
- Creating long-term disruptions by subtly degrading model accuracy over time.
Even if an attack seems small, it can have serious consequences in the real world. Misdiagnosed medical conditions, financial fraud, or errors in self-driving cars can threaten lives, cause financial loss, or erode public trust.
Case study: Tesla autopilot evasion attack
In 2020, researchers demonstrated an evasion attack on Tesla’s autopilot system. By placing three small stickers on a lane marking, they were able to trick the AI into misinterpreting the road. This caused the car to veer into the wrong lane.
This case illustrates how even sophisticated, safety-critical AI systems can be deceived by subtle input manipulations. It emphasises the importance of rigorous testing under real-world conditions, robust monitoring, and proactive mitigation strategies to prevent such failures.
How to defend against adversarial attacks
Defending AI models requires careful planning and multiple strategies:
- Adversarial training: Train models using both normal and adversarial inputs to make them more resilient.
- Input validation: Detect and remove suspicious inputs before they reach the model.
- Ensemble methods: Combine predictions from multiple models to reduce the likelihood of a single attack succeeding.
- Monitoring and anomaly detection: Watch for unexpected outputs or drops in model confidence that may indicate an attack.
- Secure development lifecycle: Integrate security into all stages of AI development, from data collection to deployment, and maintain continuous testing and auditing.
Key takeaways
- Adversarial attacks exploit the vulnerabilities of AI models and can occur during training or inference.
- Motivations include bypassing controls, stealing data, degrading model performance, and causing reputational damage.
- Poisoning attacks show how malicious inputs can compromise learning and output behaviour.
- Defending against attacks requires adversarial training, input validation, monitoring, and robust development practices.
