Your API Is a Training Dataset: How Distillation Attacks Work and How to Stop Them

DeepSeek, Moonshot, and MiniMax ran 16 million exchanges through 24,000 fraudulent accounts to extract Claude's capabilities. This is how distillation attacks work mechanically — and a layered guardrail checklist for every API and model builder.

The Heist

In late 2024 and early 2025, three Chinese AI labs — DeepSeek, Moonshot AI, and MiniMax — ran what Anthropic would later describe as an "industrial-scale" campaign to steal its most capable model's knowledge. Not by hacking servers. Not by acquiring training data. By talking to Claude.

Across roughly 24,000 fraudulent accounts, coordinated in distributed clusters designed to evade detection, the three companies generated over 16 million exchanges. They weren't looking for information. They were building a dataset — systematically eliciting Claude's reasoning traces, its agentic behaviors, its chain-of-thought patterns — everything needed to train a cheaper model to behave like a much more expensive one.

The scale is striking. MiniMax alone generated 13 million exchanges, focused almost entirely on agentic coding. When Anthropic released a new model, MiniMax pivoted within 24 hours to target the upgraded version. Moonshot collected 3.4 million exchanges targeting reasoning and computer-use capabilities. DeepSeek — smaller in volume but arguably most strategic — focused on generating chain-of-thought training data and finding ways to elicit responses that bypassed censorship filters.

This wasn't opportunistic scraping. It was a coordinated intelligence operation against a commercial AI system. And Anthropic only detected it because they built classifiers specifically designed to catch it.

Why this matters to you

You don't need to be Anthropic for this to be your problem. Any fine-tuned model with specialized capabilities — medical reasoning, legal analysis, domain-specific code generation — is a target. Your API is a training dataset if you're not treating it like one.

What Is Model Distillation?

Model distillation was invented as a compression technique. The idea is elegant: you have a large, expensive "teacher" model with strong capabilities, and you want a small, cheap "student" model that behaves similarly. Instead of training the student on raw data, you train it on the outputs of the teacher — its probability distributions, its reasoning traces, its answers. The student learns to mimic the teacher's behavior without needing the teacher's scale.

Legitimate distillation is how companies build efficient models for deployment. GPT-4 distilled into smaller variants. LLaMA fine-tuned on GPT-4 outputs (before OpenAI's terms changed). This is well-established ML engineering.

The attack version is the same technique applied without consent, at scale, against a commercial API. Instead of authorized access to the teacher's internals, attackers use the public API as a synthetic data generator. Every response is a labeled training example. Every reasoning trace is a dataset row. The student model trains on millions of these examples until it approximates the teacher's behavior — without any of the teacher's safety training, RLHF fine-tuning, or constitutional alignment.

Fig. 1 — Illicit Distillation Loop

How Reinforcement Learning Supercharges Distillation

Raw output copying is table stakes. What makes modern distillation attacks dramatically more powerful is the use of reinforcement learning to generate synthetic training data at scale.

Here's the key insight: a model that can solve a problem step-by-step, showing its reasoning trace, is worth far more as a training target than a model that just gives you the answer. When DeepSeek queried Claude for chain-of-thought traces — the "think out loud" reasoning that Claude shows before answering — they weren't collecting answers. They were collecting a reasoning curriculum.

In RL-augmented distillation, attackers go further. They generate thousands of specialized tasks, query the teacher model for solutions with full reasoning traces, then use those traces as reward signal to train the student model's own reasoning process. The student doesn't just memorize answers — it learns the underlying reasoning strategy. This is why the resulting models are so capable despite their smaller size: they've been trained on the distilled reasoning of a much larger model, not just its surface outputs.

Moonshot's focus on "agentic reasoning and computer-use capabilities" makes sense in this light. These are high-value, hard-to-train behaviors. Generating synthetic demonstrations of agent-style reasoning — plan → tool call → observe → adapt — via an existing capable model is orders of magnitude cheaper than training those capabilities from scratch.

The RL insight

Chain-of-thought traces are not just verbose answers. They are a transferable reasoning dataset. When you elicit CoT from a capable model at scale, you are building a fine-tuning curriculum — not just collecting outputs.

Fig. 2 — RL-Augmented Distillation Pipeline

The Attack Architecture: Hydra Clusters

What made these campaigns hard to detect wasn't the queries themselves — it was the evasion architecture. Anthropic's report describes what they call "Hydra cluster architectures": distributed networks of fraudulent accounts designed to look like normal API usage.

The pattern works like this. Attackers create thousands of accounts — each with plausible usage patterns, realistic rate consumption, and no single fingerprint that triggers anomaly detection. Queries are spread across accounts so no individual account trips rate limits or behavioral thresholds. The "head" directing the campaign is invisible to the API provider; only the individual accounts are visible, and each looks benign in isolation.

When detection systems adapt — recognizing patterns in query topics, timing, or response elicitation styles — the hydra pivots. A cluster gets flagged and replaced. The campaign continues under new account identities, often within hours. This is why MiniMax could pivot within 24 hours of a new Claude release: they weren't discovering the new model, they were already querying it through a pre-built evasion infrastructure.

The evasion techniques documented by Anthropic include:

Account rotation: Spreading load across thousands of accounts to avoid per-account anomalies
Rate mimicry: Consuming API at human-plausible speeds, not machine-fast bulk rates
Query diversification: Varying prompt structure and topics to avoid pattern detection
Censorship bypass: DeepSeek specifically targeted queries that elicited censorship-safe alternatives to sensitive topics
Rapid pivot: Automated infrastructure to shift targeting within hours of model updates

Fig. 3 — Hydra Cluster Evasion Architecture

You're a Target Too

It would be tempting to read the Anthropic case as a frontier-lab problem — something that only matters if you're training Claude-scale models. That's wrong.

The distillation attack works against any model with differentiated capabilities. A fine-tuned model for legal contract review. A specialized medical reasoning system trained on proprietary clinical guidelines. A code-generation model fine-tuned on your company's internal codebase. Any system where your model's behavior is substantially different from what an attacker could replicate with a public base model — that's a target.

The economics are asymmetric in the attacker's favor. Training a competitive specialized model from scratch costs millions of dollars in compute, months of expert data collection, and significant ML engineering investment. Distilling it from your API costs the price of API calls — potentially thousands of dollars against millions. Even partial distillation — capturing a subset of the target model's specialized capabilities — significantly reduces the attacker's own R&D burden.

The question isn't whether your model is "important enough" to be a target. The question is: does your model know something valuable that would be expensive to learn from scratch? If yes, you're a target.

The Builder's Guardrail Checklist

Defense against distillation attacks is layered. No single control stops a determined, well-resourced attacker — but layered controls raise the cost, slow the attack, and create detection opportunities at every layer. Here's a defense-in-depth framework organized from easiest to implement to most sophisticated.

Layer 1: Smart Rate Limiting

Naive rate limiting (N requests per minute per API key) doesn't work against hydra clusters. Sophisticated rate limiting looks at behavioral signals, not just volume:

Query diversity score: Legitimate users ask varied questions with natural distributions. Systematic extractors show unnaturally high coverage of capability domains in short time windows.
Reasoning trace request patterns: Monitor for unusually high rates of chain-of-thought elicitation. Normal users don't consistently request step-by-step reasoning for every query.
Session entropy: Human users show natural pauses, topic drift, and conversational patterns. Batch extractors show high-entropy query sequences with no conversational coherence.
Cross-account correlation: The same task categories queried from many accounts in coordinated time windows is a hydra signal, not individual behavior.

Layer 2: Behavioral Fingerprinting

Build classifiers that detect systematic extraction behavior, not just high volume. The signals that distinguish a distillation campaign from a high-volume legitimate user include:

Queries that systematically span capability boundaries (coding, reasoning, agentic, creative) rather than staying in one domain
Prompt templates that are parameterized variations of each other — signs of automated generation rather than human authorship
Suspiciously complete coverage of edge cases within a capability area — the kind of systematic coverage that suggests benchmark-style extraction rather than organic use
Immediate pivoting to new model capabilities within hours of a model release

Layer 3: Output Watermarking and Canary Data

You can't prevent extraction if you can't detect it. Output watermarking embeds statistical signals in your model's responses that are imperceptible to users but detectable in aggregate if those outputs appear in another model's training data.

Canary data is complementary: inject rare but memorable knowledge into your model's training data. If another model later "knows" these canaries, you have evidence of distillation. This is less about prevention and more about legal and evidentiary positioning.

Layer 4: Account and Identity Verification

Hydra clusters depend on cheap, frictionless account creation. Friction is a deterrent:

Phone verification: Eliminates throwaway email account bulk creation
Payment method binding: Real credit cards are traceable; prepaid cards can still be used but raise cost and operational complexity for the attacker
Organization verification for high-volume tiers: Requiring business verification for access to high-throughput API tiers cuts off bulk anonymous extraction
Anomaly-triggered KYC escalation: When behavioral signals trigger, require additional verification rather than immediately blocking — this reveals whether the account can produce a real identity

Layer 5: Model-Level Countermeasures

Anthropic's approach included model-level defenses — training the model itself to behave differently when extraction patterns are detected. This is the hardest layer to implement but the most robust:

Detection classifiers in the inference stack: Run a lightweight classifier alongside inference that scores each request for extraction signals; throttle or modify behavior for high-scoring sessions
Capability gating: Don't expose the full capability surface of your model through the API by default; require explicit opt-in for high-value capabilities like extended chain-of-thought reasoning
Output perturbation for extraction patterns: When extraction is suspected, subtly modify outputs in ways that degrade their training signal without being obviously wrong to a human reviewer

Implementation priority

Start with Layers 1 and 4. Smart rate limiting and account friction are the highest-leverage, lowest-effort controls. Behavioral fingerprinting (Layer 2) requires more investment but provides the clearest detection signal. Layers 3 and 5 are for organizations with mature security postures and specialized AI systems worth protecting.

The Honest Closing: This Is an Arms Race

None of the controls above are permanent solutions. Every detection method has an evasion. Every fingerprinting technique can be countered with sufficiently sophisticated adversarial prompting or behavioral mimicry. The Anthropic report itself acknowledges this: these campaigns were detected after they had already collected millions of exchanges. Prevention is imperfect.

What a mature defense posture actually looks like isn't a checklist you complete and forget. It's a continuous practice:

Monitor for new extraction signals as attacker techniques evolve
Share intelligence with peers — Anthropic explicitly mentions intelligence sharing with other providers as part of their response. Industry-level threat sharing slows attackers who can't rely on one provider's blindspot being universal
Accept that some extraction will happen and design your differentiation accordingly — if your model's value comes only from a capability that can be distilled, that's a business risk as much as a security one
Treat API access controls as a security surface, not just a billing concern — the design decisions you make about rate limits, account tiers, and capability exposure are security decisions

The deeper question these attacks raise is about the structure of AI development itself. Distillation attacks are a form of technology transfer — one that bypasses the export controls, licensing agreements, and access restrictions that would normally govern the transfer of strategic technology. As AI capabilities become more strategically significant, the gap between what's technically possible (distillation) and what's legally or ethically acceptable will widen. The builders who understand both sides of that gap will be better positioned than those who only see the technical problem.

Your API is a training dataset. Treat it accordingly.

Source

This post is based on Anthropic's published findings: Detecting and Preventing Distillation Attacks.