EXO began with a simple goal: to speed up AI research experiments using whatever devices a few university students had on hand. We wanted to run and train larger models but faced limitations from single-device memory and FLOPS. After months of development, we built the software infrastructure to make this possible.
The most common questions we get are about benchmarking. For example "How many tokens per second can I get if I connect 3 Macs and run Llama 70B?". Today, we're thrilled to share an automated benchmarking suite and website offering detailed performance data for running LLMs on consumer hardware. These are transparent benchmarks on real devices, not estimates.
At its core, EXO uses a technique called pipeline parallel inference. We take an LLM and split it into multiple "shards" - essentially contiguous slices of the model's layers. Each shard gets assigned to a different device. Devices can be different accelerators on the same machine like a GPU, CPU, or even NPU, or can be physically separate machines connected over a network.
In the animation above, you can see how a 9-layer model gets divided into 3 shards across 3 devices:
Since each device only holds its own shard of the model we can fit larger models fully into GPU memory across multiple devices that otherwise would not fit on a single device.
Each device processes its assigned layers sequentially, and the activations flow naturally to the next device. When processing a request, a request flows through these shards in order (in this case 4 shards):
When Shard A finishes processing its layers, it produces an activation that gets passed to Shard B over whatever network connection is available. In general these activations are actually quite small - for Llama 3.2 3B they are less than 4KB. They scale approximately linearly with the size of the layers. Therefore the bottleneck here is generally the latency between devices, not the bandwidth (a common misconception).
This process is inherently sequential. With a single request, you're only using one device at a time - the others sit idle waiting for their turn. This means single-request performance is actually worse than running on one device (if you have enough memory). However, the magic happens when you process multiple requests concurrently:
When the first request arrives, the first device starts processing it immediately. Shortly after, a second request arrives and since the first device has already finished processing the first request and passed the output to the second device, the first device is available to start processing the second request immediately. This delicate balance of processing and passing outputs (known as scheduling) continues until the final node is hit and a token is output. In this case the average device utilisation is 50% since 50% of the devices are active at any given time.
Let's look at what this means in practice. As an example, we tested EXO with LLaMA 3.2 3B across different M4 Pro configurations. Here's what we found:
Setup | Model | Single-request TPS | Multi-request TPS |
---|---|---|---|
Single M4 Pro (24GB) | LLaMA 3.2 3B | 49.3 | 49.3 |
2x M4 Pro Cluster | LLaMA 3.2 3B | 44.4 | 95.7 |
3x M4 Pro Cluster | LLaMA 3.2 3B | 39.7 | 108.8 |
These numbers tell an interesting story. For single requests, adding more devices actually hurts performance - dropping from 49.3 TPS on a single device to 39.7 TPS on three devices. This makes sense: we're adding network overhead without getting any parallelism benefit.
But look at the multi-request numbers! When handling multiple concurrent requests, we see something closer to linear scaling. Each device can work on a different request simultaneously, so a 3-device cluster processes 2.2x as many tokens as a single device (108.8 TPS vs 49.3 TPS). The theoretical maximum is 3x, but we're not quite there yet. There's a lot of work to be done to get to the theoretical maximum.
Why is this helpful? Often times the thing you really care about is the total throughput you can get out of your cluster. For example, batch workloads like summarising a set of documents don't need to run fast for individual requests - they need to process the batch of documents in the least amount of time, which is dependent on the total throughput. Also, search-based reasoning/agents can parallelise inference requests and are bottlenecked by total throughput.
This is a setup with homogeneous hardware, consistent network latency and same sized requests. We're excited to see how EXO performs on heterogeneous setups, which requires a more sophisticated scheduler to ensure that device utilisation stays high.
With our automated testing suite, we can deliver up-to-date benchmarks for a diverse set of configurations. We're continuously expanding our benchmarking suite to include more heterogeneous setups - mixing different devices and even different model quantizations. These varied configurations present interesting scheduling challenges that we're actively working to optimize.
One particular extension we're exploring is 'Tensor Parallelism', which could improve single-request throughput in cases where pipeline parallelism doesn't help (such as when the model fits in memory, as seen above). To learn more about Tensor Parallelism and join the discussion, visit our Discord community.
Have specific hardware or models you'd like to see benchmarked? We're actively expanding our test suite and would love to hear what configurations would be most valuable to you.
Request a New Benchmark