NVIDIA DGX Spark™ + Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0

EXO accelerates LLM Performance 4x with NVIDIA DGX Spark™ and Apple Mac Studio
← Back to Blog

We recently received early access to 2 NVIDIA DGX Spark™ units. NVIDIA calls it the world's smallest AI supercomputer. It has ~100 TFLOPs of FP16 performance with 128GB of CPU-GPU coherent memory at 273 GB/s.

With EXO, we've been running LLMs on clusters of Apple Mac Studios with M3 Ultra chips. The Mac Studio has 512GB of unified memory at 819 GB/s, but the GPU only has ~26 TFLOPs of FP16 performance.

The DGX Spark has 4x the compute, the Mac Studio has 3x the memory bandwidth.

What if we combined them? What if we used DGX Spark for what it does best and Mac Studio for what it does best, in the same inference request?

NVIDIA DGX Spark™ early access units (with quality control supervisor)

Mac Studio M3 Ultra stack used for LLM inference with EXO

What Determines LLM Inference Performance?

What you see as a user boils down to two numbers:

TTFT (time‑to‑first‑token): delay from sending a prompt to seeing the first token.
TPS (tokens per second): cadence of tokens after the first one appears.

Everything we do in the system exists to improve those two numbers. The reason they're hard to optimize together is that they're governed by two different phases of the same request: prefill and decode.

The lifecycle of a request (from the user's point of view)

You send a prompt.
You wait. Nothing appears. This is the prefill phase, and it determines TTFT.
The first token appears.
A stream of tokens follows. This is the decode phase, and it determines TPS.

What's happening under the hood in those two phases, and why do they behave so differently?

Figure 1: Request lifecycle showing prefill phase (yellow, determines TTFT) followed by decode phase (blue, determines TPS)

Prefill is compute-bound

Prefill processes the prompt and builds a KV cache for each transformer layer. The KV cache consists of a bunch of vectors for each token in the prompt.

These vectors are stored during prefill so we don't need to recompute them during decode.

For large contexts, the amount of compute grows quadratically with the prompt length (Θ(s²)) since every token needs to attend to all the other tokens in the prompt.

With modern techniques like Flash Attention, the data moved can be made to grow linearly with the prompt length (Θ(s)).

So the ratio between the compute and the data moved, i.e. the arithmetic intensity, is linear in the prompt length.

This makes prefill with large contexts compute-bound.

Decode is memory‑bound

Decode is the auto‑regressive loop after prefill. Each step generates one token by attending against the entire KV cache built so far.

In decode, we are doing vector-matrix multiplications which have lower arithmetic intensity than matrix-matrix multiplications.

This makes decode memory-bound.

Use different hardware for each phase

Once you separate the phases, the hardware choice is clear.

Prefill → high compute device.
Decode → high memory-bandwidth device.

Prefill on DGX Spark, transfer KV, decode on M3 Ultra

If you prefill on one device and decode on another, you must send the KV cache across the network. The naive approach is to run prefill, wait for it to finish, transfer the KV cache, then start decode.

Figure 2: Naive split showing prefill (yellow), KV transfer (green), then decode (blue)

This adds a communication cost between the two phases. If the transfer time is too large, you lose the benefit.

Overlap communication with compute

The KV cache doesn't have to arrive as one blob at the end. It can arrive layer by layer.

As soon as Layer 1's prefill completes, two things happen simultaneously. Layer 1's KV starts transferring to the M3 Ultra, and Layer 2's prefill begins on the DGX Spark. The communication for each layer overlaps with the computation of subsequent layers.

Figure 3: Layer-by-layer pipeline showing prefill (yellow) and KV transfer (green) overlapping across layers. Decode (blue) starts immediately when all layers complete.

In practice, EXO transfers the KV vectors of a layer while the layer is being processed, since the KV vectors are computed before the heavy compute operations. To hide the communication overhead, we just need the layer processing time (t_comp) to be larger than the KV transfer time (t_send).

Full overlap is possible when the context is large enough

The compute time is t_comp = F / P, where F is the FLOPs per layer and P is machine FLOPs/s. For large contexts, F scales quadratically: F ∼ c₁s², where c₁ is a model-dependent constant.

The transfer time is t_send = D / B, where D is KV data in bits and B is network bandwidth in bits/s. The KV cache has a constant number of vectors per token, so D ∼ q·c₂·s, where q is quantization (4-bit, 8-bit, etc.) and c₂ is model-dependent.

To fully hide communication, we need the transfer time to be less than the compute time: t_send < t_comp. This means P/B < F/(q·D) ∼ (c₁/c₂)·s/q. With DGX Spark at 100 TFLOPs FP16 and 10 GbE (10 Gbps) link between the DGX Spark and the M3 Ultra, the ratio P/B = 10,000. This means we need s > 10,000q/(c₁/c₂).

The constant K = c₁/c₂ depends on the attention architecture. For older models with multi-head attention (MHA) like Llama-2 7B, K = 2. For models with grouped query attention (GQA), K is larger: Llama-3 8B has K = 8, while Llama-3 70B and Qwen-2.5 72B have K = 16.

With 8-bit KV streaming and K = 16 (Llama-3 70B), the threshold is s > 5k tokens. For K = 8 (Llama-3 8B), it's s > 10k tokens. For K = 2 (Llama-2 7B), it's s > 40k tokens.

Benchmark results: Llama-3.1 8B with 8k context

Running Llama-3.1 8B (FP16) with an 8,192 token prompt and generating 32 tokens:

Configuration	Prefill Time	Generation Time	Total Time	Speedup
DGX Spark	1.47s	2.87s	4.34s	1.9×
M3 Ultra Mac Studio	5.57s	0.85s	6.42s	1.0× (baseline)
DGX Spark + M3 Ultra	1.47s	0.85s	2.32s	2.8×

The combined setup achieves the best of both worlds: DGX Spark's fast prefill (3.8× faster than M3 Ultra) and M3 Ultra's fast generation (3.4× faster than DGX Spark), delivering 2.8× overall speedup compared to M3 Ultra alone.

EXO 1.0 does this automagically

Disaggregated prefill and decode, layer-by-layer KV streaming, and hardware-aware phase placement are all automated in EXO.

When you start EXO, it automatically discovers all devices connected in your ad-hoc mesh network and profiles each for compute throughput, memory bandwidth, memory capacity, and network characteristics.

Given a model and your topology, EXO plans which device should handle prefill, which should handle decode, whether to pipeline across layers, when to stream KV, and how to adapt if network conditions change. You don't write the schedule. You don't compute the thresholds. You just run the model, and EXO figures out how to make your heterogeneous cluster fast.

Inference is no longer constrained by what one box can do, but by what your whole cluster can do together.

NVIDIA DGX Spark and Mac Studio M3 Ultra connected together

NVIDIA DGX Spark and Mac Studio M3 Ultra working together for optimized inference

Be the first to hear what's new

EXO Labs | GitHub