Decentralized Training - 12 days of EXO

Day 5 - 12 days of EXO
← Day 4 | Day 6 → | Watch Video

In the previous posts, we explored how EXO makes frontier models more accessible. By clustering consumer devices, we can run powerful models like DeepSeek-v3 671B, and tools like Llama98.c enable us to run LLMs on 25-year-old hardware.

But what about training? Can we train and fine-tune models on consumer devices? Today, we'll explore one of the main challenges in distributed training: communication cost. We'll examine approaches like DiLoCo that make decentralized training possible, potentially enabling a future where open-source contributors can train models together, SETI@home style.

Training an LLM: The Basics

Consider training a language model on 1,000 books, where each book represents one batch of data. On a single machine (like a GPU or MacMini), if processing one book takes 1 minute, training on all books would take 1,000 minutes (16 hours and 40 minutes). It's fully sequential:

Figure 1: Sequential training on a single device
Because it's entirely sequential, the total training time for N batches would be:
single_node_training_time = single_batch_compute_time * N [1]

Parallel Training: The Promise and the Catch

To speed things up, we can split data across multiple machines - a technique called data parallelism. However, there's a caveat: synchronization.

Figure 2: Data parallel training with synchronization

With 2 machines processing different books simultaneously, each develops slightly different model weights. To prevent divergence, they must synchronize their weights periodically. The new training time becomes:

multi_node_training_time = (single_batch_compute_time + synchronization_time) * N/2 [2]

We can understand when parallel training is beneficial by comparing the total training times:

multi_node_training_time < single_node_training_time
(single_batch_compute_time + synchronization_time) * N/2 < single_batch_compute_time * N [3]

Simplifying:

synchronization_time < single_batch_compute_time [4]

This reveals a key insight: parallel training is only faster when the synchronization time is less than the computation time for a single batch. This makes intuitive sense - if synchronization takes longer than processing a batch, we'd be better off training sequentially on a single machine.

The Data Center vs. Internet Challenge

The synchronization cost varies dramatically based on your network environment. Consider a 7B parameter model in float16 (approximately 14GB of data):

Environment Network Speed Sync Time
Data Center 200 Gbps ~0.56 seconds
Home Internet 100 Mbps ~20 minutes

This leads to vastly different training times for our 1,000-book example:

Scenario Training Time Viability
Single GPU 1,000 minutes (16.7 hours) Baseline - predictable sequential processing
2 GPUs (Data Center) 505 minutes (8.4 hours) Excellent - nearly perfect 2x speedup due to fast network
2 GPUs (Internet) 10,500 minutes (7.3 days) Impractical - slow network makes synchronization extremely costly

The challenge becomes even more pronounced with more nodes, as the total data transferred often scales quadratically with the number of nodes due to multiple all-to-all communication steps. This makes traditional distributed training over the internet impractical without specialized techniques.

Enter DiLoCo: Making Internet Training Viable

DiLoCo (Distributed Low-Communication Training) and similar approaches like Local SGD and Decentralized SGD offer a solution: instead of synchronizing after every step, let each GPU train independently for H steps before synchronizing.

Figure 3: DiLoCo training with H=3 steps between synchronization

Since we are only synchronizing every H training steps, the total synchronization time is reduced by a factor of 1/H. The total training time becomes:

diloco_training_time = (single_batch_compute_time + 1/H * synchronization_time) * N/2 [4]

To determine when DiLoCo training is beneficial, we can compare it with single-node training:

diloco_training_time < single_node_training_time
(single_batch_compute_time + 1/H * synchronization_time) * N/2 < single_batch_compute_time * N [5]

Simplifying:

synchronization_time < H * single_batch_compute_time [6]

This is a powerful result! When H=1, it reduces to our previous condition (synchronization_time < single_batch_compute_time). But with H=500, synchronization only needs to be less than 500 times the single batch computation time. This makes distributed training viable even over slow networks, as long as the synchronization time is less than 500 times the computation time for a single batch.

With H=500, even slow internet connections become viable. Here's the complete comparison:

Scenario Training Time Viability
Single GPU 1,000 minutes (16.7 hours) Baseline - predictable sequential processing
2 GPUs (Data Center) 505 minutes (8.4 hours) Excellent - nearly perfect 2x speedup due to fast network
2 GPUs (Internet) 10,500 minutes (7.3 days) Impractical - slow network makes synchronization extremely costly
DiLoCo (H=500) 520 minutes (8.7 hours) Game-changing - matches data center performance over slow networks

By synchronizing only every 500 steps, we achieve training times comparable to data center performance (8.7 hours) even over slow internet connections. This is a dramatic improvement from the impractical 7.3 days required by traditional distributed training over the internet.

The implications are profound: we can now train models without requiring access to a large data center. This opens the door to aggregating compute from smaller providers and training models across them. In the future, this could enable hobbyists to pool their globally distributed compute resources over the internet and train models collaboratively, similar to how SETI@home pioneered distributed computing by connecting millions of home computers to search for extraterrestrial intelligence.

What's Next?

The reason data parallel training becomes slow over a low-bandwidth distributed network is that there is an expensive synchronization step. For a large model, Gigabytes of weights have to be sent over the network every training step. DiLoCo presents a way to solve this problem by only synchronizing every H steps. If H = 500, that means we only synchronize every 500 steps.

What if we don't need to synchronize the whole model at all? What if instead of sending 14GB over a network, we send 1MB? Approaches such as SPARTA and DisTro enable us to perform data parallel training without needing to synchronize the whole model. If you'd like to learn more, check out the rest of the 12 Days Of EXO, or join the research channel in our Discord.

Want to try DiLoCo training yourself? Check out our open-source DiLoCo simulator that lets you experiment with different distributed training configurations, even on Apple Silicon. In one of the upcoming posts in the 12 Days of EXO series, we'll release a tool that makes running and experimenting with various distributed training approaches easier.

Be the first to hear what's new

< Day 5 of 12 >