Running Llama on Windows 98 - 12 days of EXO

Day 4 - 12 days of EXO
← Day 3 | Day 5 → | Watch Video

Frontier AI doesn't have to run in a datacenter. We believe this is a transient state. So we decided to try something: getting Llama running on a Windows 98 Pentium II machine.

If it runs on 25-year-old hardware, then it runs anywhere.

The code is open source and available at llama98.c. Here's how we did it.

The Hardware Setup

First, we needed the machine itself. We found a Windows 98 Pentium II for £118.88 on eBay.

eBay listing for the Windows 98 Pentium II machine

Found this yellowed beauty on eBay - a Windows 98 Pentium II with 128MB RAM for £118.88

Getting it to work with modern peripherals was the first challenge - none of our USB keyboards or mice worked. The solution was going back to PS/2 peripherals, with one catch: the mouse had to go in port 1 and the keyboard in port 2. The reverse configuration didn't work.

Back panel of the Windows 98 machine showing various ports

The back panel showing PS/2 ports, serial ports, and crucially - an ethernet port that would prove essential

File Transfer: Back to FTP

The next challenge was getting files onto the machine. We needed to transfer model weights, tokenizer configs, and inference code. Modern solutions failed:

RW disks weren't recognized by the system
Our 4TB USB drive was too large for FAT32

What worked in the end was good old FTP. It turns out FTP has stayed backwards compatible all these years. We ran FileZilla FTP server on our M4 MacBook Pro, connected it to the Windows 98 machine via Ethernet (using a USB-C to Ethernet adapter), set up static IPs, and could transfer files directly from the command line.

MacBook network settings showing manual IP configuration

Configuring the MacBook's USB-C Ethernet adapter with a manual IP (192.168.1.1) to communicate with Windows 98

After setting up the network configuration, we needed to verify the connection. A simple ping test confirmed that the machines could talk to each other:

Windows 98 command prompt showing successful ping

Success! The Windows 98 machine can reach the MacBook with <1ms latency

With the network connection established, we could finally transfer files using FTP. One critical gotcha: executables wouldn't run until we discovered they needed to be transferred in binary mode. The fix was simple - just type "binary" in the FTP CLI:

FTP session showing binary transfer of model file

Transferring the stories260K.bin model file over FTP in binary mode

The Compilation Challenge

Getting modern code to compile for Windows 98 was tricky. We first tried mingw, which supposedly could compile modern C++ for Windows 98/Pentium II. That turned into a dead end - possibly due to CMOV instructions that aren't supported pre-Pentium Pro.

Instead, we went old school: Borland C++ 5.02, a 26-year-old IDE and compiler that ran directly on Windows 98. The only gotcha was it supports a very old version of C/C++. Modern C++ was out of the question, but C has changed surpisingly little over the decades. The biggest change in C was in 1999 (C99), so unfortunately we just missed that. The main limitation on this older C was no "declare anywhere" variables - everything had to be declared at the start of functions.

Borland C++ 5.02 - a 26-year-old IDE that still gets the job done

Karpathy to the Rescue

This led us to Andrej Karpathy's llama2.c - 700 lines of pure C that can run inference on models with Llama 2 architecture. Perfect. But it still needed some tweaks for Win98 on a Pentium II:

Replaced `long long` with `DLONGWORD` (using typedef)
Moved all variable declarations to function starts
Simplified the disk-to-memory loading (memory mapping caused SEGFAULTS)
Fixed timestamp issues by replacing `clock_gettime` with `GetTickCount()`

The code for this project is available on GitHub at llama98.c.

Success! The 260K parameter Llama model running on Windows 98, generating a story about Sleepy Joe

The Results

Finally got it working! Here's what we achieved, running entirely on the Pentium II CPU - no GPU required:

Model	Parameters	Tokens/second
stories260K	260K	39.31
stories15M	15M	1.03
Llama 3.2 1B*	1B	0.0093

* The result with Llama 3.2 1B is based on benchmarks running a shard of the model that fits into memory plus disk read benchmarks. llama98.c is being extended with offloading capabilities to run larger models to test this for real.

Not exactly ChatGPT speeds, but getting any modern AI model running on 25-year-old CPU hardware is a significant step toward our mission. Special thanks to Chris Wellons' excellent blog post on getting C++ working on Windows 98 which helped make this possible.

The Future: BitNet and Beyond

BitNet is a promising direction for a world where frontier models really can run on any hardware. BitNet is a transformer architecture that uses ternary weights - each weight can only be 0, -1, or 1, requiring just 1.58 bits per weight (log₂(3) ≈ 1.58). This simple change has massive implications:

Matmul turns into addition with ternary weights (Ma et al. 2024)

All the usual matrix multiplication we'd have turn into a bunch of additions and subtractions, since multiplying by 0 is omission, multiplying by 1 is addition and multiplying by -1 is subtraction.

The advantages are striking:

A 7B parameter BitNet model needs only 1.38GB of storage - small enough to fit on most hardware, even decades-old PCs (the PC we used had a 1.6GB Hard Drive)
It's CPU-first: Microsoft's BitCPP can generate 52 tokens/second on an M2 Ultra CPU and 18 tokens/second on an Intel i7
Even more impressive: a 100B parameter BitNet can run on a single CPU at human reading speed (5-7 tokens/second)
It's energy efficient: more than 50% more efficient than full-precision models

At EXO, we've been working on ternary models for a while. In April 2024, we released MLX-BitNet for efficient running on Apple Silicon. We presented the first ever implementation of a BitNet for protein language modeling at ICML 2024 and are working on a larger BitNet model for protein modeling.

While there isn't a large open-source BitNet model available yet, we believe ternary models are the future of accessible AI. We're planning to train one in 2025.

What's Next?

We want to see more efforts focused on running AI models on older hardware. There's significant engineering work to be done here - from optimizing memory usage to exploring new architectures that can run efficiently on limited hardware.

If you're interested in running models on old hardware, like an old Mac, Gameboy, Motorella phone, or even an old Raspberry Pi, check out the the code and join our Discord #retro channel. The future of AI doesn't have to be locked in massive datacenters - it can run right on the hardware you already have.

Be the first to hear what's new

< Day 4 of 12 >