DeepSeek V3 is an open-source model that matches the quality of the best closed-source models from OpenAI and Anthropic. We've shown it running on a cluster of M4 mac minis, demonstrating that frontier models can now run on consumer hardware.
While running models locally keeps your data private, it creates a trade-off: local models can't access real-time information. Ask a local LLM "What is the price of Microsoft stock today?" and you'll get outdated data from its training set. This limitation doesn't exist with cloud-based tools like Perplexity, which can access current information.
So we face a dilemma: how can we maintain the privacy benefits of local models while still accessing real-time information?
LLMs predict the next most likely token based on all the data they were trained on. When asked about real-time information like weather or stock prices, they can only use what they learned during training - leading to outdated responses.
However, by augmenting the LLM with real-time search, we can ground its responses in current data. The LLM first retrieves relevant information from an external source, then uses this context to generate accurate, up-to-date responses.
The problem is that this database is hosted on a server somewhere in the cloud. The host of the server can see our query and the response of that query, compromising our privacy.
A potential solution would be to download all the data locally. That way, we could query the database locally keeping the query and response private. But this is impractical - Twitter alone generates 200TB of text per year. Your phone can't store that much data.
What we want is a way for the server to process our query without ever seeing what we were asking for or what results we got back. Let's see if we can find a way to do that by looking at how search works.
Modern search engines work by converting both documents and queries into normalized vectors (called "embeddings") where similar items end up close together. For example:
"I love cats" → A = [0.7, 0.71] (normalized) "I like felines" → B = [0.5, 0.866] (normalized) "Buy cheap stocks" → C = [-0.8, 0.6] (normalized)
To find relevant documents, we look for vectors that point in similar directions - meaning they have a small angle between them. The cosine of this angle gives us a stable, efficient similarity measure ranging from -1 to 1:
The cosine similarity formula is shown below, where a and b are vectors with components a₁, a₂, etc. and b₁, b₂, etc., θ is the angle between them, · represents the dot product, and ||a|| denotes the length (magnitude) of vector a:
However, when vectors are normalized (meaning their length equals 1), the denominator becomes 1·1 = 1, leaving us with just the dot product:
This is why working with normalized vectors is so efficient - the similarity calculation reduces to simply multiplying corresponding elements and adding up the results.
Let's compute the similarities between our example vectors:
The first high value (0.964) tells us vectors A and B point in similar directions - just as we'd expect for two phrases about cats! The second value (-0.254) shows that vectors A and C point in quite different directions, indicating unrelated topics.
This seemingly simple calculation - just multiplications and additions - is the key to private search.
Some encryption schemes can perform addition and multiplication on encrypted data. This means a server can compute inner products without ever seeing the actual vectors!
However, this privacy comes at a significant computational cost. Using pyfhel on an M3 MacBook, encrypted addition is 105x slower than normal addition - and this is with one of the simpler encryption schemes, Paillier. Computing similarity with every document would be impractical.
Lets take a step back and see if there's any way we can speed this up. We've taken our database (a set of documents) and converted them into a set of embedding vectors. We've also taken our query and converted it into an embedding vector. Now we want to find the document that is most similar to our query.
Looking at the vector space visually, we can see that most vectors are completely different from our query vector. These vectors naturally form groups in different regions of the space.
The red cluster is clearly the closest to our query - we can safely ignore all others. Each cluster has a centroid (its average vector). By comparing our query against these centroids first, we only need to search within the most relevant cluster.
Let's quantify this improvement. With 100M documents split into 10K clusters:
First we do 10,000 comparisons to find the most relevant cluster. Then we do 10,000 comparisons to find the most relevant document in that cluster:
A 5,000x reduction in vectors to compare with!
The Tiptoe paper optimizes this approach by partitioning the database into √N clusters, where N is the total number of documents. This balances two competing costs:
With X·Y=N, setting X=Y=√N minimizes the total cost.
TipToe downloads all centroids locally (for 100M documents, that's 10K centroids or ~10MB). Of course, we might end up choosing a cluster that doesn't contain the most relevant document. This approach works well for conceptual queries ("knee pain") though less so for exact matches ("123 Main Street"). Its search quality matches traditional tf-idf algorithms.
Here's how the full protocol works:
We have developed a full implementation of private search. The code is open source here. It's available today in the EXO Desktop app (early alpha - version 0.0.3-alpha) on the releases page of the exo github repo.
We are taking signups for access to the API endpoint if you would like to integrate EXO private search into your app. Sign up for early access to the API here.
We can now have the best of both worlds: local models with true privacy guarantees and access to real-time information. Your phone can privately query massive databases without revealing anything about your searches.
The future of AI isn't just local - it's private too.