Frontier AI doesn't have to run in a datacenter. We believe this is a transient state. So we decided to try something: getting Llama running on a Windows 98 Pentium II machine.
If it runs on 25-year-old hardware, then it runs anywhere.
The code is open source and available at llama98.c. Here's how we did it.
First, we needed the machine itself. We found a Windows 98 Pentium II for £118.88 on eBay.
Getting it to work with modern peripherals was the first challenge - none of our USB keyboards or mice worked. The solution was going back to PS/2 peripherals, with one catch: the mouse had to go in port 1 and the keyboard in port 2. The reverse configuration didn't work.
The next challenge was getting files onto the machine. We needed to transfer model weights, tokenizer configs, and inference code. Modern solutions failed:
What worked in the end was good old FTP. It turns out FTP has stayed backwards compatible all these years. We ran FileZilla FTP server on our M4 MacBook Pro, connected it to the Windows 98 machine via Ethernet (using a USB-C to Ethernet adapter), set up static IPs, and could transfer files directly from the command line.
After setting up the network configuration, we needed to verify the connection. A simple ping test confirmed that the machines could talk to each other:
With the network connection established, we could finally transfer files using FTP. One critical gotcha: executables wouldn't run until we discovered they needed to be transferred in binary mode. The fix was simple - just type "binary" in the FTP CLI:
Getting modern code to compile for Windows 98 was tricky. We first tried mingw, which supposedly could compile modern C++ for Windows 98/Pentium II. That turned into a dead end - possibly due to CMOV instructions that aren't supported pre-Pentium Pro.
Instead, we went old school: Borland C++ 5.02, a 26-year-old IDE and compiler that ran directly on Windows 98. The only gotcha was it supports a very old version of C/C++. Modern C++ was out of the question, but C has changed surpisingly little over the decades. The biggest change in C was in 1999 (C99), so unfortunately we just missed that. The main limitation on this older C was no "declare anywhere" variables - everything had to be declared at the start of functions.
This led us to Andrej Karpathy's llama2.c - 700 lines of pure C that can run inference on models with Llama 2 architecture. Perfect. But it still needed some tweaks for Win98 on a Pentium II:
The code for this project is available on GitHub at llama98.c.
Finally got it working! Here's what we achieved, running entirely on the Pentium II CPU - no GPU required:
Model | Parameters | Tokens/second |
---|---|---|
stories260K | 260K | 39.31 |
stories15M | 15M | 1.03 |
Llama 3.2 1B* | 1B | 0.0093 |
* The result with Llama 3.2 1B is based on benchmarks running a shard of the model that fits into memory plus disk read benchmarks. llama98.c is being extended with offloading capabilities to run larger models to test this for real.
Not exactly ChatGPT speeds, but getting any modern AI model running on 25-year-old CPU hardware is a significant step toward our mission. Special thanks to Chris Wellons' excellent blog post on getting C++ working on Windows 98 which helped make this possible.
BitNet is a promising direction for a world where frontier models really can run on any hardware. BitNet is a transformer architecture that uses ternary weights - each weight can only be 0, -1, or 1, requiring just 1.58 bits per weight (log₂(3) ≈ 1.58). This simple change has massive implications:
All the usual matrix multiplication we'd have turn into a bunch of additions and subtractions, since multiplying by 0 is omission, multiplying by 1 is addition and multiplying by -1 is subtraction.
The advantages are striking:
At EXO, we've been working on ternary models for a while. In April 2024, we released MLX-BitNet for efficient running on Apple Silicon. We presented the first ever implementation of a BitNet for protein language modeling at ICML 2024 and are working on a larger BitNet model for protein modeling.
While there isn't a large open-source BitNet model available yet, we believe ternary models are the future of accessible AI. We're planning to train one in 2025.
We want to see more efforts focused on running AI models on older hardware. There's significant engineering work to be done here - from optimizing memory usage to exploring new architectures that can run efficiently on limited hardware.
If you're interested in running models on old hardware, like an old Mac, Gameboy, Motorella phone, or even an old Raspberry Pi, check out the the code and join our Discord #retro channel. The future of AI doesn't have to be locked in massive datacenters - it can run right on the hardware you already have.