On Day 5, we explored distributed/decentralized training. While previously considered unfeasible, new algorithms like DiLoCo now enable effective training on low-bandwidth distributed networks. However, testing these methods remains challenging due to hardware maintenance, fault-tolerance, and networking complexities. To accelerate distributed AI research, we're introducing EXO Gym.
DiLoCo has demonstrated the ability to train large language models with 100-1000x less bandwidth. It works by letting each GPU train independently using an "inner optimizer," only synchronizing with other GPUs every H steps (e.g., H = 500) using an "outer optimizer." This approach reduces bandwidth requirements by a factor of H. For more details, see our beginner's guide or the full paper.
The initial paper demonstrated training a 300M parameter model while requiring 500x less bandwidth. The team at PrimeIntellect later scaled this method to train 1B and 10B parameter models. However, several fundamental questions remain unanswered:
The answer to all of these questions (even when asking the DiLoCo first author or the team at PrimeIntellect) is "we don't know." It's quite hard to train a model on a distributed network as many issues arise, so not many people have experimented. Even in the Discussion of the DiLoCo paper (Section 5) itself, it states that one limitation is that the method was exclusively tried on a text transformer.
Inspired by OpenAI Gym (now Gymnasium), which revolutionized reinforcement learning research in 2016, EXO Gym provides a similar toolkit for distributed training research. Like its predecessor, it offers benchmark problems and a common interface for testing and comparing new algorithms.
EXO Gym lowers the barrier to entry in distributed training research through an evolving collection of simulators that run on a single machine. For example, diloco-sim simulates the DiLoCo algorithm by training on N simulated nodes locally, abstracting away the complexity of maintaining a real distributed system.
At the core of EXO Gym is the ability to simulate distributed training methods on a single machine. One key example is testing DiLoCo with different model architectures - addressing a limitation noted in the original paper where only transformer models were explored.
We can simulate a training run using N nodes and train a CNN with just a few lines of code. The goal is for researchers to write minimal code and focus on ideas and algorithmic development. Here's a code snippet:
simulator = DilocoSimulator(
model_cls=CNNModel,
model_kwargs={
"num_classes": 100,
"input_channels": 3,
"input_height": 32,
"input_width": 32
},
optimizer_kwargs={"lr": 0.001},
num_nodes=2,
train_dataset=train_dataset,
eval_dataset=test_dataset,
loss_fn=F.cross_entropy,
num_epochs=10
)
simulator.train()
In this diagram, we're looking at the evaluation loss of training a CNN model. With a single node, the first time we reach an evaluation loss below 2.5 is after 6000 steps (12*500). However, with 2 nodes we can reach the same loss in less than 4000 steps. This initial result is promising regarding CNNs working with DiLoCo, but it's far from conclusive.
DiLoCo uses an inner optimizer (typically AdamW) to do local training at every step. Then every H steps, the models from different GPUs are synchronized using an outer optimizer. But what happens when there is a single node? Essentially, we are using the AdamW optimizer every step, then adding a second optimizer every H steps.
In the paper, the authors show that this dual optimizer approach can perform better than just using AdamW. This means that even outside of the realm of distributed training, a dual optimizer approach could be beneficial just for single-node training.
To accelerate the discovery of new distributed training algorithms, we're hosting competitions with real-world validation. Join our Discord research channel and sign up for the first EXO gym competition.
The first competition focuses on training nanoGPT using diloco-sim with the following parameters:
Winning solutions will be validated and scaled up using:
The EXO team will routinely scale new models on our setup of distributed compute. If you'd like to volunteer compute to support these efforts, get in touch!
Unlike traditional open-source software development, AI research often requires substantial computing resources, limiting participation to well-funded labs. With tools like EXO Gym, anyone can train and validate ideas locally before scaling up promising solutions. Several companies including Pluralis, PrimeIntellect, Nous, and Gensyn are already working on distributed/decentralized training - now you can join them in shaping the future of AI.
To stay up to date on developments in this area, join our Discord research channel and sign up to the first exo gym competition.