Verde: a verification system for machine learning over untrusted nodes

This is an academic paper describing Verde, a verification protocol for machine learning programs, as well as the underlying Reproducible Operators (RepOps) system that enables it. RepOps is a library that ensures bitwise reproducibility of ML work across diverse hardware.

‍

The Gensyn network creates a global free market over ML compute. It allows anyone to provision resources - from data centers to Macbooks - while ensuring uniform execution across them. This allows developers to train ultra-high scale models at low cost. It also unlocks new forms of collaborative training that were not possible before.

To realise this vision we need a mechanism to verify the work of untrusted suppliers in a scalable way. A naive approach would be to use a trusted intermediary to either reproduce every task or manually onboard providers and create a whitelist of trusted parties. In either case we quickly run into scaling limits – with the former, the replication overhead would be untenable; with the latter, we would cut off the long-tail of suppliers who wish to participate.

More sophisticated approaches like cryptographic proof systems guarantee correctness, but they're prohibitively expensive for large-scale ML workloads (at least for now). Heuristic approaches, like Proof-of-Learning or Proof-of-Training Data, offer efficiency at the expense of strong security guarantees.

We instead turn to the idea of refereed delegation, which uses verifiers to check the work done by each supplier. If the verifier believes the supplier's output is incorrect, they can use an efficient dispute resolution game to convince a neutral referee of that fact. These techniques are the basis of optimistic verification used in blockchain rollups like Arbitrum and Optimism, with the blockchain validators acting as the referee. Disputes over entire blocks of transactions can be resolved by re-running just a single program step.

However, refereed delegation does not pair well with modern machine learning for two reasons. First, it was designed for CPU programs and does not efficiently translate to large-scale neural networks. Second, it assumes that honest servers will always compute the same result for the same program, which is often not true in machine learning if they are using different hardware.

Any viable mechanism therefore must solve these issues in a scalable way.

Introducing Verde

‍

Today we’re excited to announce Verde, the first verification protocol purpose-built for modern machine learning within decentralized environments.

Verde consists of a lightweight dispute resolution system that pinpoints the first training step and operator within the neural network’s computational graph that the trainer and verifier disagree on. Now, instead of re-running the entire task, referees, which could be a smart contract or a jury of verifiers, only recompute that single disputed operation. This drastically cuts verification overhead while ensuring that, if at least one verifier is honest, the correct result is guaranteed. The overheads for compute suppliers are also light, as they’re required to only store and hash intermediate training checkpoints. ‍

‍

Reproducibility

For this system to work, we need ML programs to be reproducible across all hardware setups, such that different (honest) nodes will compute the same output, regardless of what device they are using. This is typically not the case by default, even for different devices from the same manufacturer (e.g. Nvidia A100s vs. H100s).

To solve this, we’ve built Reproducible Operators (RepOps), a library that implements bitwise reproducible versions of popular ML operators. This solves the hardware non-determinism issue noted above, by enforcing a fixed execution order of floating point operations when computing functions like matrix multiplication. This ensures that honest providers always yield bitwise identical outputs, allowing Verde’s dispute resolution protocol to operate reliably.

Learn More

Verde provides a core building block for decentralized machine learning. It allows Gensyn to onboard every compute device in the world, from data centers to the edge, in a scalable and permissionless way.

To learn more about it, you can read the full paper here.

To see how reproducibility works using RepOps in a live demo, follow the instructions here.
‍

‍

Research

Gensyn

Verde: a verification system for machine learning over untrusted nodes

Introducing Verde

‍

Reproducibility

Learn More

Further reading