At the Horizon: DiffusionBlocks makes training custom models 4x cheaper

june 2026

abstract light over a horizon

Most AI startups rent their intelligence. They use a frontier model through an API. They add prompts. They add retrieval. Sometimes they train a LoRA. That is enough to ship a product, but it creates a ceiling: the product is custom, and the model is not.

DiffusionBlocks points to a different future. It is a training method from Sakana AI that splits a model into blocks and trains those blocks independently. In the paper, that cuts training memory by 3x to 4x in several experiments while keeping performance competitive. That matters because memory decides who gets to train models.

The Training Bottleneck

Normal model training is end-to-end. The model runs forward through all its layers. Then backpropagation runs backward through those same layers to update the weights. This takes memory.

During training, the system stores activations. An activation is the internal state produced by a layer. The model needs those activations later to compute gradients.

If a model has 12 layers, normal training needs to remember what happened in all 12. If a model has 48 layers, it needs to remember all 48. That is why depth is expensive: more layers means more stored activations, and larger layers mean larger activations.

This is the bottleneck DiffusionBlocks attacks.

What DiffusionBlocks Changes

DiffusionBlocks changes the unit of training. Instead of training the whole network at once, it splits the network into B blocks.

A block is a group of layers. For example, a 12-layer transformer split into 3 blocks has 4 layers per block.

Normal training says:

Train all 12 layers together.

DiffusionBlocks says:

Train 4 layers at a time.

That is where the memory savings come from. If the model is split into B blocks, each training step stores activations for about 1 / B of the model. So a 3-block model uses about 3x less training memory. A 4-block model uses about 4x less.

That is the core idea: the model is not smaller, but the training job is smaller.

End-to-end training stores activations for every layer. Blockwise training only stores the block being trained.

Training memory, normalized to end-to-end training

End-to-end 1.00x

3 blocks 0.33x

4 blocks 0.25x

Training setup	What gets trained at once	Memory intuition
End-to-end	All layers	Store activations for the whole model
3 blocks	One third of the model	About `3x` less activation memory
4 blocks	One quarter of the model	About `4x` less activation memory

Why Diffusion Is Involved

Diffusion models start with noise and remove it step by step. One denoising step does not create the whole output. It only makes the sample cleaner. After enough steps, noise becomes structure.

DiffusionBlocks applies that idea to neural networks.

A deep model already transforms information gradually. Early layers build rough features. Middle layers refine them. Later layers shape the final output.

DiffusionBlocks treats each block like one refinement step. Each block learns a local job:

given this representation, move it closer to the target

That turns one large training problem into many smaller ones.

Each block performs a local refinement, gradually moving an internal representation closer to the target.

The Important Numbers

The paper reports results across image generation and text generation. The big point is simple: memory drops a lot, but quality does not collapse.

For image generation, lower FID is better. For language models, lower BPC and lower perplexity are better. For MAUVE, higher is better.

Relative improvement over the reported baseline

CIFAR-10 FID +6.6%

ImageNet FID +12.1%

text8 BPC +7.1%

LM1B PPL +15.5%

LM1B MAUVE +42.0%

Benchmark	Metric	Baseline	DiffusionBlocks	Better direction
CIFAR-10	FID	`39.83`	`37.20`	lower
ImageNet	FID	`12.09`	`10.63`	lower
text8	BPC	`1.56`	`1.45`	lower
LM1B	PPL	`14.58`	`12.32`	lower
LM1B	MAUVE	`0.50`	`0.71`	higher

So the claim is not just “use less memory and accept worse performance.” In several cases, DiffusionBlocks used less memory and improved the metric.

How This Compares to LoRA

LoRA is the practical customization tool startups use today. It freezes the original model weights and adds small trainable matrices to some layers. Instead of changing the whole model, it learns a small update.

The model becomes:

base model + small learned adapter

That is why LoRA is useful. It is cheap, fast, and lets a team adapt a model without retraining every parameter. But LoRA is still a patch: it adapts a model after the model has already been trained. DiffusionBlocks is different because it changes the training structure itself.

Method	What changes	What it gives you
Prompting	Input text	Fast behavior changes
RAG	Context	External knowledge
LoRA	Small adapter weights	Cheap fine-tuning
Full fine-tuning	The whole model	Deep customization, high cost
DiffusionBlocks	The training process	Modular training

LoRA asks:

How do we cheaply adapt this model?

DiffusionBlocks asks:

Can we train models in pieces?

That is a bigger shift.

Why Startups Should Care

Today, many AI startups use the same core models. That makes the intelligence layer hard to defend. A startup can still win on product, workflow, data, and distribution. But the model itself is often rented.

A modular training method changes that.

A legal startup could train around contract structure. A coding startup could train around a specific codebase. A healthcare startup could train around clinical notes. A robotics startup could train around its own sensor data.

The model does not need to change everywhere. It only needs to change where the product is different.

That is the startup opportunity. Custom models become less like training a full foundation model and more like engineering the part of the system that matters most.

The Next Question

This is still early. The biggest open question is whether this works well for the large pretrained language models companies already deploy. That is the real unlock: if it does, DiffusionBlocks is not just a research idea. It becomes a way to make custom model training much more practical.

The paper also shows an interesting extension to recurrent-depth models. These models normally apply the same network many times during training. DiffusionBlocks replaces that long training loop with a single-pass setup. That is the kind of result startups should watch closely, not because every number will transfer directly, but because the direction is clear: training is becoming more modular.

a long road stretching toward distant mountains

Conclusion

LoRA made customization cheaper. DiffusionBlocks points toward something deeper: making training itself modular.

That matters because the best model for a product may not be the largest general model. It may be the model that fits the product best.

That is the horizon.