Edge-Verified Machine Learning - 12 days of EXO

Day 10 - 12 days of EXO
← Day 9 | Day 11 → | Watch Video

On Day 5, we presented a distributed training algorithm (DiLoCo) that works even in settings with low bandwidth. This was used to train a model on 5 Mac Mini's demonstrating the ability to train models without a data center with fast interconnects. To scale this and train larger models, what if we crowdsource compute across the internet and collectively train an open-source model? Dark compute is all around us including Macs, laptops, Tesla's TVs and more.

Crowdsourcing compute could dramatically accelerate training, but it introduces a major challenge: trust. When strangers contribute their devices to train an open-source model, how do we ensure they're performing the training process correctly? If we don't verify what each device is doing, a malicious participant could deliberately worsen the model (poisoning) or fake their participation in exchange for a reward.

In this blog, we'll explore existing approaches to verifiable compute and introduce Edge-Verified Machine Learning, our solution for making decentralised training both secure and practical.

The Key Requirements

When picking a solution, we need to balance three key requirements:

First, we need strong correctness guarantees. Second, the network must have a low barrier to entry. The easier it is to join, the more compute power we can harness. However, if we require participants to own special hardware, or stake capital, that will limit the amount of compute we can harness. Third, the system must be cost-efficient. Training a model can already cost millions of dollars. So even an overhead of 10x would be extremely costly.

The ideal protocol would deliver these three properties: strong security guarantees, low barriers to entry, and small overhead. Let's examine how different approaches measure up against these goals.

Potential solutions

1. Consensus

The most intuitive idea is to have multiple nodes carry out each computation, and then compare the different results. The same batch of data can be sent to N different devices with instructions to carry out the training (caveat: determinism could be required). Then, when the results are submitted, they are compared. If more than half the nodes agree (a majority), we use that answer. Additionally, we might penalize the nodes with wrong answers (e.g., by labelling them dishonest, or, if repeated, kicking them out of the network). This approach works and is a key pillar behind many distributed systems.

For this approach, the barrier to entry is low as no special requirements are needed. The overhead of this approach is proportional to N, the number of nodes. If we have 10 nodes the same work is replicated 10 times. Thus to keep the costs to a minimum, N should be as low as possible. But this affects the security of the network. The security of this approach depends on the number of honest nodes. For the network to be secure, we need to satisfy the following:

Number of honest nodes > total number of nodes / 2

For N = 3, the network has a 3x overhead and poor security. All it takes is 2 out of the 3 nodes to collude, then the data can be poisoned.

2. Zero-Knowledge Proofs

Another method for verification is zero-knowledge proofs (ZKPs). This method relies on cryptography to guarantee the correctness of the execution. Each node carries out the required computation and then generates proof for correction execution. Generating a proof is computationally expensive, but verifying the proof is fast and cheap. In our scenario, every computation required for training will be sent to one node, with no replication. The node will carry out the required computation for training and generate a proof that it was done correctly.

ZKPs don't require any capital to be staked and thus have a low barrier to entry. The security of a ZKP is based on Hash functions, for example, SHA-256. This hash function contains 256 bits and can represent 2^{256} values. This means that to fake a ZKP proof, we need to find that one correct output out of 2^{256}. This is more than the number of atoms in the universe.

1/2^{256} = 10^-77 ≈ zero

This level of security comes at a great cost. Modern systems for generating Zero Knowledge Proof for Machine Learning (zkML) have an overhead often exceeding a 1000x. zkLLM, an advanced system for ZK specialized for large language models, has managed to prove inference on LLM with up to 13 billion parameters in 1 to 15 minutes for a single inference. According to the paper, this outperforms all existing approaches. While theoretically ZKPs are a great solution, the overhead is tremendous making it impractical.

3. Economic Security via Optimistic Machine Learning (opML)

Verifiability via Economic Security means preventing dishonest behaviour by imposing financial penalties on those who misbehave.

One approach is Optimistic Machine Learning (opML), which minimizes overhead by assuming nodes are honest by default, reducing the need for redundant computations. Every node must deposit funds (the stake) before participating in the opML network. Whenever a participant produces a result, it's assumed to be correct, and the result is not questioned unless a validator node accuses the participant of fraud. Accordingly, opML relies on financial incentives to motivate validators to check each result by re-doing them, since they get rewarded only if they're able to expose a participant for lying or misbehaving during the training process.

This approach has a high barrier to entry as every node needs to stake capital. In terms of overhead, OPML is usually cheaper than existing approaches like consensus. In consensus, every node has to redo the same computation every time. However, in opML the validators are not required to do so. They can rather use heuristics to strategically redo computations only when they think there is a high chance of the results being false. In the worst case, opML has an overhead of N, but in practice, it's lower.

opML has many limitations. It usually produces a mixed-strategy Nash Equilibrium, which means that some fraudulent results will inevitably go undetected because it is sometimes worth lying despite the risk of getting caught. Increasing the validator's financial reward for identifying fraud will only solve this problem if it's enough to encourage multiple validators to re-do every computation, resulting in additional cost and overhead.

Another approach based on economic security is proof of sampling. Although efficient, this approach also requires nodes to stake capital introducing a high barrier to entry.

The above are 3 main approaches for verifying computations in a network. ZKPs provide the highest level of security but come at a prohibitively high cost. Consensus and OPML have lower overhead but are still significant for the task of training a model. Each of the above approaches has different variations and teams working on improving them. However, none of them are practical for use today. This motivated the development of evML.

Edge Verified Machine Learning, evML

To make decentralized training work, we need participants to compute honestly—but without requiring expensive deposits or specialized hardware. evML achieves this by leveraging the secure hardware built into common devices to make lying to the network both difficult and irrational, creating a similar solution to Economic Security but without the need for a financial deposit.

The premise of evML is that we can use the existing hardware-based capabilities of consumer devices, like Apple and Android devices, to verify whether the results they submit are correct. Instead of assuming that consumer devices are extremely resistant to hacking, we combine the natural cost-of-attack with the threat of being caught and banned from the network. This makes hacking the device expensive and irrational, creating high assurance that results are correct at a very low overhead of 5%, assuming that the cost of hacking the hardware is $2000.

TEEs vs SCs

Most modern phones come with a Trusted Execution Environment (TEE) — a hardware-isolated section of the processor that runs code completely separate from the rest of the device. This isolation is enforced at the silicon level, meaning even if your phone is hacked, the TEE remains secure. Through Remote Attestation, the TEE can prove three critical things to anyone:

Genuine Hardware: It proves that it is genuine secure hardware, not a software simulation.
Unmodified Software: It attests that specific, unmodified software is running on-device.
Correct Outputs: It certifies that each output produced by the unmodified software on real hardware is correct.

However, mobile applications cannot run directly within the TEE. Instead, they operate within Secure Contexts (SCs) — protected runtime environments within the operating system that inherit security guarantees from the device's TEE.

SCs depends on the Secure Boot process to ensure that applications are running in a trusted and isolated environment, while also utilizing the secure key custody provided by the TEE to perform Remote Attestation.

Importantly for evML, the SC's ability to perform Remote Attestation allows anyone to verify that its submitted results are correct.

How Do SC's Work?

Secure Boot guarantees the integrity of the software running on the device, including the bootloader, operating system, and firmware, using cryptographic signatures. It verifies each component against public keys securely embedded in the TEE, creating a chain of trust that ensures only unmodified, authorized software is executed on the device.

Remote Attestation is achieved by SCs through the TEE's hardware root of trust—a unique cryptographic key embedded during manufacturing. This key is used to sign results produced by an application, allowing others to verify that the submitted results are correct.

Since the results can only be accurate when the SC is running authorized software on a genuine, secure device, the TEE is responsible for ensuring that Secure Boot has succeeded. This prevents malicious or compromised software from using the root-of-trust key to certify incorrect results, compromising the security of Remote Attestation.

Services like Play Integrity (Android) and DeviceAttest (Apple) facilitate Remote Attestation of data produced by mobile apps, which are computed within their own Secure Contexts. Before signing the software-generated results with the hardware root of trust, Play Integrity and DeviceAttest verify that the running software is indeed authentic.

How evML Uses SC's

When a device seeks to participate in the evML network, its Secure Context (SC) initiates a connection with the Network Operator (NO). Through Remote Attestation provided by DeviceAttest or Play Integrity, the SC proves that it is operating on genuine hardware with an approved operating system and running unmodified >EXO training software. This attestation process leverages the hardware root of trust managed by the TEE, ensuring the integrity of the entire execution environment.

Upon successful attestation, the NO issues a cryptographic credential to the SC over an encrypted channel. This credential is securely stored and accessible only by the Training Software operating within the SC. The Training Software uses this credential to anonymously certify each computational result, guaranteeing that calculations are performed correctly within the secure and isolated environment provided by the SC. The Remote Attestation provided by the SC is linked to the unique hardware root of trust, enabling the NO to ensure it is not issuing more than one credential per device.

While it's technically possible to hack an SC or its associated TEE with significant investment, evML uses random spot-checks to make cheating economically irrational. A small percentage of results are verified by having other devices redo the calculation. If results don't match, additional devices verify until there's agreement. Devices caught submitting incorrect results lose their credential permanently. Since checks are random, anyone who hacks their TEE to submit false results will eventually be caught, making their investment worthless. This creates a reliable training network requiring no upfront costs, works with existing popular devices, and maintains just 5% overhead since only a small fraction of results need verification.

To learn more about evML, see this following Github repository containing a report and analysis code.

Final Thoughts

evML is a technique for verifying the results produced by a network of edge devices. It's cost-effective (only 5% overhead) and provides strong security guarantees. Additionally, it provides some privacy guarantees. Crucially, the barrier to joining an evML network is negligible.

Although it's not suitable for very high-stakes applications, like processing private medical data or financial information, it serves as a strong foundation for building secure hybrid solutions. Where greater security is required and a higher barrier-to-entry is acceptable, we may fortify evML with new techniques inspired by Economic Security to create a system of tiered security.

If you'd like to review the current version of the evML paper, get in touch. The preliminary report is available here. To keep up with our research, check out the Exo Discord!

Be the first to hear what's new

< Day 10 of 12 >