Accelerating distributed AI research with EXO Gym - 12 days of EXO

Day 9 - 12 days of EXO
← Day 8 | Day 10 → | View Tweet

On Day 5, we explored distributed/decentralized training. While previously considered unfeasible, new algorithms like DiLoCo now enable effective training on low-bandwidth distributed networks. However, testing these methods remains challenging due to hardware maintenance, fault-tolerance, and networking complexities. To accelerate distributed AI research, we're introducing EXO Gym.

Distributed Training Methods are Under-explored

DiLoCo has demonstrated the ability to train large language models with 100-1000x less bandwidth. It works by letting each GPU train independently using an "inner optimizer," only synchronizing with other GPUs every H steps (e.g., H = 500) using an "outer optimizer." This approach reduces bandwidth requirements by a factor of H. For more details, see our beginner's guide or the full paper.

Figure 1: DiLoCo distributed training architecture showing the inner and outer optimization loops

The initial paper demonstrated training a 300M parameter model while requiring 500x less bandwidth. The team at PrimeIntellect later scaled this method to train 1B and 10B parameter models. However, several fundamental questions remain unanswered:

Does it work for training a CNN or diffusion model?
If so, which optimizers work best in those cases?
What about Protein Language Models?
Can DiLoCo be combined with different compression methods?
Are there better ways to perform the averaging step?

The answer to all of these questions (even when asking the DiLoCo first author or the team at PrimeIntellect) is "we don't know." It's quite hard to train a model on a distributed network as many issues arise, so not many people have experimented. Even in the Discussion of the DiLoCo paper (Section 5) itself, it states that one limitation is that the method was exclusively tried on a text transformer.

Introducing EXO Gym

Inspired by OpenAI Gym (now Gymnasium), which revolutionized reinforcement learning research in 2016, EXO Gym provides a similar toolkit for distributed training research. Like its predecessor, it offers benchmark problems and a common interface for testing and comparing new algorithms.

Figure 2: EXO Gym simulates training across multiple machines without requiring actual distributed hardware

EXO Gym lowers the barrier to entry in distributed training research through an evolving collection of simulators that run on a single machine. For example, diloco-sim simulates the DiLoCo algorithm by training on N simulated nodes locally, abstracting away the complexity of maintaining a real distributed system.

Simulators and Examples

At the core of EXO Gym is the ability to simulate distributed training methods on a single machine. One key example is testing DiLoCo with different model architectures - addressing a limitation noted in the original paper where only transformer models were explored.

Example 1: Training a CNN

We can simulate a training run using N nodes and train a CNN with just a few lines of code. The goal is for researchers to write minimal code and focus on ideas and algorithmic development. Here's a code snippet:

simulator = DilocoSimulator(
    model_cls=CNNModel,
    model_kwargs={
      "num_classes": 100,
      "input_channels": 3,
      "input_height": 32,
      "input_width": 32
    },
    optimizer_kwargs={"lr": 0.001},
    num_nodes=2,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    loss_fn=F.cross_entropy,
    num_epochs=10
  )

  simulator.train()

Figure 3: Evaluation loss comparison between single-node and two-node CNN training

In this diagram, we're looking at the evaluation loss of training a CNN model. With a single node, the first time we reach an evaluation loss below 2.5 is after 6000 steps (12*500). However, with 2 nodes we can reach the same loss in less than 4000 steps. This initial result is promising regarding CNNs working with DiLoCo, but it's far from conclusive.

Example 2: NanoGPT on a Single Node

DiLoCo uses an inner optimizer (typically AdamW) to do local training at every step. Then every H steps, the models from different GPUs are synchronized using an outer optimizer. But what happens when there is a single node? Essentially, we are using the AdamW optimizer every step, then adding a second optimizer every H steps.

In the paper, the authors show that this dual optimizer approach can perform better than just using AdamW. This means that even outside of the realm of distributed training, a dual optimizer approach could be beneficial just for single-node training.

Training loss on NanoGPT for single node DiLoCo

Figure 4: Training loss comparison for NanoGPT using single node DiLoCo with different synchronization intervals (H=50 vs H=500). Lower H values lead to faster convergence but require more frequent optimizer synchronization.

Competitions and Scaling

To accelerate the discovery of new distributed training algorithms, we're hosting competitions with real-world validation. Join our Discord research channel and sign up for the first EXO gym competition.

The first competition focuses on training nanoGPT using diloco-sim with the following parameters:

Model: NanoGPT (124M parameters)
Averaging interval H = 500
Simulate 4 nodes
Try new optimizers

Winning solutions will be validated and scaled up using:

MacMini Cluster: 8 M4 MacMinis with thunderbolt-5 connections for medium-scale models (≈1B parameters)
Large-scale Network: Distributed data center setup for larger models (7B+ parameters)

The EXO team will routinely scale new models on our setup of distributed compute. If you'd like to volunteer compute to support these efforts, get in touch!

Democratizing AI Research

Unlike traditional open-source software development, AI research often requires substantial computing resources, limiting participation to well-funded labs. With tools like EXO Gym, anyone can train and validate ideas locally before scaling up promising solutions. Several companies including Pluralis, PrimeIntellect, Nous, and Gensyn are already working on distributed/decentralized training - now you can join them in shaping the future of AI.

To stay up to date on developments in this area, join our Discord research channel and sign up to the first exo gym competition.

Be the first to hear what's new

< Day 9 of 12 >