12 Days of EXO

DeepSeek V3 is an open-source model that matches the quality of the best closed-source models from OpenAI and Anthropic. We've shown it running on a cluster of M4 mac minis, demonstrating that frontier models can now run on consumer hardware.

While running models locally keeps your data private, it creates a trade-off: local models can't access real-time information. Ask a local LLM "What is the price of Microsoft stock today?" and you'll get outdated data from its training set. This limitation doesn't exist with cloud-based tools like Perplexity, which can access current information.

Figure 1: Asking a local LLM for the price of Microsoft stock. It returns an outdated answer of $340-$350 taken from the data it was trained on (the price at time of writing is $421.50).

So we face a dilemma: how can we maintain the privacy benefits of local models while still accessing real-time information?

The Need For Search

LLMs predict the next most likely token based on all the data they were trained on. When asked about real-time information like weather or stock prices, they can only use what they learned during training - leading to outdated responses.

Figure 2: When asked about the weather, an LLM generates tokens sequentially based only on its training data.

However, by augmenting the LLM with real-time search, we can ground its responses in current data. The LLM first retrieves relevant information from an external source, then uses this context to generate accurate, up-to-date responses.

Figure 3: With real-time search, the LLM first retrieves current weather data and incorporates it into its generation

The problem is that this database is hosted on a server somewhere in the cloud. The host of the server can see our query and the response of that query, compromising our privacy.

A potential solution would be to download all the data locally. That way, we could query the database locally keeping the query and response private. But this is impractical - Twitter alone generates 200TB of text per year. Your phone can't store that much data.

What we want is a way for the server to process our query without ever seeing what we were asking for or what results we got back. Let's see if we can find a way to do that by looking at how search works.

How Normal Search Works

Modern search engines work by converting both documents and queries into normalized vectors (called "embeddings") where similar items end up close together. For example:

Vector similarity visualization showing three normalized vectors: two similar vectors about cats and one unrelated vector about stocks

Figure 4: Visualization of the three normalized vectors in 2D space. Vectors A and B point in similar directions (high similarity), while vector C is in a different direction to A (low similarity).

To find relevant documents, we look for vectors that point in similar directions - meaning they have a small angle between them. The cosine of this angle gives us a stable, efficient similarity measure ranging from -1 to 1:

The cosine similarity formula is shown below, where a and b are vectors with components a₁, a₂, etc. and b₁, b₂, etc., θ is the angle between them, · represents the dot product, and ||a|| denotes the length (magnitude) of vector a:

cos(θ) = (a·b)/(||a||·||b||)
= (a₁b₁ + a₂b₂ + ...)/(√(a₁² + a₂² + ...) · √(b₁² + b₂² + ...))

However, when vectors are normalized (meaning their length equals 1), the denominator becomes 1·1 = 1, leaving us with just the dot product:

cos(θ) = a·b = a₁b₁ + a₂b₂ + a₃b₃ + ...

This is why working with normalized vectors is so efficient - the similarity calculation reduces to simply multiplying corresponding elements and adding up the results.

cos(θ₁) = A·B = (0.7 × 0.5) + (0.71 × 0.866) = 0.964 // Similar!
cos(θ₂) = A·C = (0.7 × -0.8) + (0.71 × 0.6) = -0.134 // Quite different

The first high value (0.964) tells us vectors A and B point in similar directions - just as we'd expect for two phrases about cats! The second value (-0.254) shows that vectors A and C point in quite different directions, indicating unrelated topics.

relevance = query_vector · document_vector
= q₁d₁ + q₂d₂ + q₃d₃ + ...

This seemingly simple calculation - just multiplications and additions - is the key to private search.

Private Search with Homomorphic Encryption

Some encryption schemes can perform addition and multiplication on encrypted data. This means a server can compute inner products without ever seeing the actual vectors!

Figure 5: Homomorphic addition of two encrypted vectors

However, this privacy comes at a significant computational cost. Using pyfhel on an M3 MacBook, encrypted addition is 105x slower than normal addition - and this is with one of the simpler encryption schemes, Paillier. Computing similarity with every document would be impractical.

Lets take a step back and see if there's any way we can speed this up. We've taken our database (a set of documents) and converted them into a set of embedding vectors. We've also taken our query and converted it into an embedding vector. Now we want to find the document that is most similar to our query.

Document vectors clustered in space, with centroids and query vector

Figure 6: Document vectors clustered in space, showing query vector (yellow)

Looking at the vector space visually, we can see that most vectors are completely different from our query vector. These vectors naturally form groups in different regions of the space.

Figure 7: Document vectors clustered in space, showing clusters and query vector (yellow)

The red cluster is clearly the closest to our query - we can safely ignore all others. Each cluster has a centroid (its average vector). By comparing our query against these centroids first, we only need to search within the most relevant cluster.

Total documents = 10,000 clusters × 10,000 documents per cluster = 100M

First we do 10,000 comparisons to find the most relevant cluster. Then we do 10,000 comparisons to find the most relevant document in that cluster:

Total comparisons = 2 × 10,000 = 20,000
Reduction in vectors to compare with = 100M ÷ 20,000 = 5,000x

The Tiptoe paper optimizes this approach by partitioning the database into √N clusters, where N is the total number of documents. This balances two competing costs:

TipToe downloads all centroids locally (for 100M documents, that's 10K centroids or ~10MB). Of course, we might end up choosing a cluster that doesn't contain the most relevant document. This approach works well for conceptual queries ("knee pain") though less so for exact matches ("123 Main Street"). Its search quality matches traditional tf-idf algorithms.

Figure 8: Architecture overview of private search with homomorphic encryption. The query is encrypted before being sent to the server, which processes it without being able to see the contents. The encrypted results are sent back to the client for decryption.

The Full Protocol

EXO Private Search

We have developed a full implementation of private search. The code is open source here. It's available today in the EXO Desktop app (early alpha - version 0.0.3-alpha) on the releases page of the exo github repo.

Figure 9: EXO Private Search in the EXO Desktop app

The Future of AI is Private

We can now have the best of both worlds: local models with true privacy guarantees and access to real-time information. Your phone can privately query massive databases without revealing anything about your searches.