Running DeepSeek V3 671B on M4 Mac Mini Cluster - 12 days of EXO

Day 2 - 12 days of EXO
← Day 1 | Day 3 →

We just got the biggest open-source model running on Apple Silicon.

Without further ado, here are the results running DeepSeek v3 (671B) on a 8 x M4 Pro 64GB Mac Mini Cluster (512GB total memory):

Model Time-To-First-Token (TTFT) in seconds Tokens-Per-Second (TPS)
DeepSeek V3 671B (4-bit) 2.91 5.37
Llama 3.1 405B (4-bit) 29.71 0.88
Llama 3.3 70B (4-bit) 3.14 3.89

Wait, Deepseek has 671B parameters and runs faster than Llama 70B?

Yes!

Let me explain…

Understanding LLM Inference: A Systems View

Note: As DeepSeek just came out today, we had our hands full getting this to work in EXO. The blog post is incomplete but if you'd like to read a blog post that goes into how EXO splits models up (relevant to this), see Day 1 where we explain how EXO splits models up into shards. Otherwise, if you're happy reading an unfinished blog post go right ahead.

Let's take a systems view of what happens when we run an LLM. That way we don't get bogged down in the details and we'll be able to understand why exactly we got these results.

At its core, an LLM is a massive collection of parameters - billions of floating point numbers that define the model's behavior. LLMs are what's known as "autoregressive" models. Each token generated depends on the previous one, making the model inherently sequential. For each token, a bunch of computations (matrix multiplications and non-linear operations.) are performed using the model parameters. These are typically performed on whatever device gives us the most floating point operations per second which today is typically a GPU.

Here's the key insight: in standard LLM architectures, generating each token requires accessing every parameter in the model. This means for each token, we need to:

  1. Load the model parameters in the GPU
  2. Perform floating point operations using these parameters
  3. Sample the next token based on the output
  4. Repeat this process feeding the token we generated in (3) into the model

Steps 1 and 2 take a lot longer than steps 3 and 4 so let's just focus on them.

How long does step 1 take? It takes as long as it takes to move all the parameters to the GPU.
How long does step 2 take? It takes as long as it takes to perform all the floating point operations on those parameters.

For step 1 to be fast, we need to be able to move the parameters quickly to the compute device. We want to keep all the parameters as "hot" as possible so we can move them quickly to the GPU for processing. The best option we have here is to keep the model parameters in GPU memory.

For step 2 to be fast, we compute on the GPU which can compute trillions of floating point operations per second on some GPUs. The powerful the GPU, the faster we can compute.

Therefore we have two potential bottlenecks:

Whether inference is memory-bandwidth-bound or compute-bound depends on the ratio of two numbers:

First, C: The rate at which parameters are computed on

C = Parameters/second computed = FLOPS/second ÷ (FLOPS/parameter)

FLOPS/second is the total number of floating-point operations the GPU can compute per second.

FLOPS/parameter (sometimes called parameter compute density) is the number of floating-point operations computed per parameter.

And

M: The rate at which parameters are transferred to the GPU

M = Parameters/second transferred = Memory bandwidth ÷ (Bytes/parameter)

Memory bandwidth is how many bytes are pushed from memory to the GPU each second.

Bytes/parameter is determined by the model's numerical precision (e.g., half-precision FP16 uses 2 bytes per parameter).

If

C/M > 1

then we're memory-bandwidth-bound. If

C/M < 1

then we're compute-bound.

This relationship changes with batch size. For batch_size=1 (generating one sequence at a time), the parameter compute density is low, making inference typically memory-bandwidth-bound.

For batch_size=N, the parameters per second computed goes as 1/N but the parameters per second transferred to compute-device does not change. The reason is because we're now computing on multiple sequences at a time. Therefore there's more computation that can be done per parameter of the model.

This is the reason why pre-training is typically compute-bound.

Understanding these fundamentals helps explain why LLM inference optimization is challenging and provides context for more advanced techniques that work to overcome these basic constraints.

For example, techniques like speculative-decoding enable us to turn batch_size=1 inference into something that looks more like batch_size=N inference. It computes multiple tokens ahead using a smaller model that approximates a larger model, allowing multiple sequences to be computed in parallel.

Apple Silicon

Apple Silicon is actually very good at running LLMs with batch_size=1. Why is that? Two reasons:

  1. Apple Silicon uses unified memory which has up to 192GB of memory all on one chip, accessible at a high bandwidth to the GPU
  2. The ratio of memory bandwidth : FLOPS is very high, particularly on the latest M4 chips. E.g. The M4 Max has 546GB/s of memory bandwidth and ~34TFLOPS (fp16) = ~68 GB/s, a ratio of ~8.02. Whereas NVIDIA RTX 4090 has 1008GB/s memory bandwidth and ~330TFLOPS (fp16) = ~660GB/s, a ratio of ~1.52.

For these reasons, apple silicon is actually much better optimised for running large models with batch_size=1. The ratios of the hardware match more closely the ratios of the workload. Other kinds of workloads like pretraining have a ratio that better matches NVIDIA hardware.

Mixture-of-Experts (MoE) Models

This gives us the foundation to understand Mixture-of-Expert (MoE) models, the architecture DeepSeek V3 671B uses.

Again, taking a systems view, a Mixture-of-Expert model is simply an LLM that only uses a subset of parameters every time we run an inference. However, we don't know ahead of time which subset of parameters is going to be used. Therefore, we still need to keep all the parameters "hot" ready to be sent to the GPU but only a small amount of parameters will be sent there for computation.

Apple devices are great in this set-up as they have a lot of memory. We can simply loads all of the 671B parameters across a cluster of Macs, however, the actual inference will only activate a small subset of the parameters.

In the case of LLama 70B, the model is dense, which means we need to compute on 70B parameters to generate every token. In the case of deepseek, we only need 37B parameter but as mentioned, you don't know in advance which subset that will be. This means that as long as we can hold all the parameters in memory, we can generate a single request faster.

Be the first to hear what's new

< Day 2 of 12 >