SPARTA: Distributed Training with Sparse Parameter Averaging - 12 days of EXO

Day 12 - 12 days of EXO
← Day 11 | Watch Video

In Day 5, we discussed distributed training with a method called DiLoCo. This method allows us to get a 100x to 1000x reduction in communication and thus makes training over poor bandwidth (e.g., over the internet) possible. In this blog post, we introduce SPARTA, a different method for distributed training that can achieve more than 1,000x reduction in communication without major degradation in performance. The research for SPARTA was led by Matthew Reed and Mohamed Baioumy. We introduce SPARTA and how it compares to DiLoCo. Additionally, we show that principles from both methods can be combined.

To speed up experimentation with methods like SPARTA and DiLoCo, we have built EXO Gym. This allows you to simulate a distributed setup on a single machine and test ablations quickly. Click here to read more.

Recap on Distributed low communication training.

Full recap in the day 5 blog post.

In a nutshell, one method to speed up training is called data parallelism. This means that the dataset is split and a batch of data is sent to every GPU. Consider the simplest case with just 2 GPUs training an LLM. Each batch of data contains a book. Each of the 2 GPUs trains on a book separately. After this training step, the 2 models are slightly different and thus need to synchronize. If we're training a 7B parameter model on 100 Mb internet, this will take 20 minutes. This makes the whole method impractical. A visualization of regular data parallel training is provided below.

Figure 1: Regular data parallel training across multiple GPUs

DiLoCo takes inspiration from federated learning and devises a scheme to lower the communication overhead. Instead of the 2 GPUs synchronizing every step, what if they synchronize every H steps? Each GPU trains on a subset of the data using what's referred to as an 'inner optimizer' (often AdamW). Then after H = 100 steps, both GPUs are synchronized using a second optimizer referred to as an 'outer optimizer'. With this setup, the communication overhead is reduced by H, as we only synchronize every H steps. A visual depiction of DiLoCo is provided below.

Figure 2: DiLoCo training process

For more details on distributed training and DiLoCo, see this blog post and the DiLoCo paper.

Core idea of SPARTA

The core issue we're facing is that synchronizing the entire model over the internet is extremely slow. But why do we even synchronize? What if the 2 GPUs just train independently, each using half the data. In this setting during training we have 2 independent models training. This approach would have no communication cost. However, the models would diverge. In the end as we merge them (e.g. by averaging the 2 models), the performance goes down significantly often resulting in gibberish output.

Here we have 2 extremes. If the models don't share any information during training they diverge. If the models fully synchronize at every step, they will converge (faster than a single node training) but the communication cost is prohibitive. To balance convergence and communication cost, what if we can average a small part of the model? The intuition here is that if the 2 GPUs share enough parameters during training they would not diverge but still be able to train with low communication requirements.

The idea of SPARTA is to reduce communication by sharing a small portion of the model with other nodes. This will keep the models highly correlated while requiring little communication. SPARTA stands for Sparse Parameter Averaging for Reduced-communication Training. Instead of synchronizing the entire model, we only average a small portion of the model parameters, for example 0.1%. This means that the 2 GPUs communicate at every time step, but they only communicate a small amount suitable to send over the internet.

When only a small set of parameters is shared, the communication is dramatically reduced. In the case of only sharing 0.1% of the parameters, we achieve a 1000x reduction in data transferred during the training process. Below is an illustration of SPARTA.

Figure 3: SPARTA training process

There are 2 important notes on SPARTA:

In SPARTA, the models on the different GPUs are not synchronized during training. At every step, only a small set of the parameters is shared, and the result is an ensemble of models. After training is complete, we can merge the models. One way is to simply average them.
In SPARTA, we can exchange a sparse set of parameters (e.g., 0.1%) from a previous time step. In our results, we found that SPARTA's performance doesn't degrade even when the parameters shared are from up to 15 time steps in the past. This means that the training and parameter exchange can be done fully asynchronously! Training does not have to stop to share parameters.

Similar to DiLoCo, the ideas of SPARTA come from federated learning and are adapted for this specific setting. The two methods are not mutually exclusive, and some principles can be combined, as will be shown in the experiments.

Experiments

In this section, we provide initial results for SPARTA. The following experiments are carried out on NanoGPT, a 124M parameter transformer (GPT-2 size). To evaluate the performance, all the models are averaged and the eval loss is computed. Note that with SPARTA, the models are not fully averaged/synchronized; however, this is necessary to evaluate the model during training.

There are various configurations in SPARTA. For example, there are multiple ways to sample the portion of parameters to be exchanged. In this section, we'll show results for random sampling. Every GPU samples a random set of x% parameters and shares it with others and averaged. This is the simplest version of SPARTA, labeled SPARTA-random. Below is the eval perplexity for SPARTA-random on 2 Nodes. It's compared to a single node (baseline) and 2 nodes fully synchronizing (2x batch size + data parallelism). We see that SPARTA performs comparable to fully synchronizing between the 2 nodes while requiring a fraction of the communication budget.

Figure 4: Evaluation perplexity comparison between baseline and SPARTA-random

Now we test SPARTA while varying the number of GPUs. We show the baseline of training on a single node with AdamW as well as SPARTA on N nodes, with N = {2, 4, 8}. The percentage of parameters shuffled is 0.5%.

Eval Perplexity for Varying the Number of Nodes

Figure 5: Evaluation perplexity when varying the number of nodes

As mentioned, one parameter that can be tuned is x, the portion of the model parameters to be exchanged. When choosing the optimal value for x, we're balancing performance with communication cost. If 100% of the parameters are exchanged, we're basically performing a version of full data parallel training. This results in high accuracy and models converging faster. If too little of the model parameters are shared, the different models start to diverge and the eval loss goes up when the models are merged.

Eval Perplexity for Varying the % of Parameters Exchanged

Figure 6: Impact of varying the percentage of parameters exchanged

One chart to help understand what's happening is the correlation of the model parameters. While we only synchronize a small set of parameters, doing this repeatedly will cause the correlation to increase between the models. So, while they are not the same model, they become highly correlated. For exchanging 0.5% of the parameters, the models reach a correlation of 0.9 around 1,000 steps. However, beyond a certain point, it take a long time for the models to be highly correlated as can be seen for the purple line.

Figure 7: Model parameter correlations over training steps

One key observation is that the models quickly become highly correlated, suggesting that the percentage of parameters to be exchanged can be adjusted over time: at the start, a larger set of parameters, then decreasing. Investigations of an optimized schedule are left to future work.

Just by shuffling a small percentage of the parameters, we can see significant gains in performance. Exchanging 0.5% means a 200x reduction in communication, and 0.1% means a 1,000x reduction. To further lower the amount of data transferred, we can take an idea from the DiLoCo paper. Instead of averaging the parameters at every time step, we do so every H steps. Unlike DiLoCo, we're not using an outer optimizer. It's simply a parameter average which, as mentioned, could be done asynchronously.

Eval Perplexity for a 4 Nodes Cluster While Varying H

Figure 8: Impact of varying the number of steps between parameter exchanges (H)

For H = 20 or lower, we observe limited performance degradation. However, for high values, training will be become unstable. Note that by exchanging 0.5% of the parameters every 20 steps, we are reducing the amount of data transferred by 4,000x without any other compression techniques like quantization.

Finally, we can combine SPARTA with DiLoCo. This means that during the inner optimization steps, we average a small portion of the model parameters, then every H steps, we perform an outer optimization step.

Eval Loss for Combining DiLoCo and SPARTA

Figure 9: Performance comparison when combining DiLoCo and SPARTA

By combining SPARTA with DiLoCo, we notice that the performance improves. As shown above, SPARTA results in model parameters being highly correlated within a few hundred steps. We suspect this causes a performance gain for DiLoCo. Additionally, we suspect that by using SPARTA, we could increase the number of inner steps DiLoCo can take without performance suffering. However, this has not been explicitly investigated yet.

The above results are promising; however, we acknowledge that they are early. We welcome feedback and ideas to test. For transparency, we also share the following observations from our experiments:

SPARTA makes training more stable. We have observed that we can increase the learning rate such that the baseline does not converge while SPARTA still converges. This is likely due to a regularization effect. This higher learning rate can be incorporated for faster convergence; however, we have excluded those results in the experiments section to avoid complicating the analysis.
SPARTA provides an ensemble of N models in the end. These models can be averaged, resulting in one final model. However, it might be beneficial to use them as an ensemble in various use cases such as inference-time compute or for Bayesian approximation.
SPARTA-random and DiLoCo both don't scale well beyond 16 nodes, with DiLoCo scaling slightly better.

Final thoughts

The above results are using the simplest version of SPARTA, labeled SPARTA-random. However, the parameters do not have to be randomly selected. Rather, in some experiments, we already see improved performance using other sampling schemes like layer-wise sampling. We limited the scope of this blog post to random sampling as a demonstration that even random sampling works. Additionally, we observed limited performance degradation when the averaging was performed in a ring topology, similar to a gossip algorithm. This greatly reduced communication overhead when scaling. An all-reduce scales quadratically while ring topology scales linearly.

There are many variations of SPARTA, DiLoCo and other algorithms yet to be developed. One method we're particularly excited about is DeMo, which similar to SPARTA only exchanges a small amount of data between nodes. DeMo is an optimizer which can be combined with SPARTA or used separately.

Next up, we are scaling variants of SPARTA to run on 48 mac minis. Results coming soon. Join our Discord to stay updated.

To test different variations of these algorithms, we have developed EXO Gym, a tool to simulate distributed training algorithms and test configurations very easily. With a few lines of code, we can for example test DiLoCo and see how it performs on different architectures or under different conditions. Read more about it here, and you can participate in our first competition for EXO Gym here.

Be the first to hear what's new

< Day 12 of 12 >