Introduction
During COVID, Folding@home reached a major milestone. The research project amassed 2.4 exaFLOPS of compute, sourced from 2 million volunteer devices around the world. This represented fifteen times the processing power of the world’s biggest supercomputer at the time, enabling scientists to conduct simulations of COVID protein dynamics at massive scale. Their work advanced our understanding of the virus and its underlying pathology during the early days of the pandemic.
Folding@home built on a long history of volunteer computing, where projects crowdsource compute to solve large-scale problems. The idea gained prominence with SETI@home in the 1990s, which crowdsourced over 5 million volunteer computers in search of extraterrestrial life. It’s since been applied across various domains, including astrophysics, molecular biology, mathematics, cryptography, and games. In each case, the crowd extends the capacity of the individual project, far beyond what they could otherwise achieve. This drives progress and allows research to happen in a more open and collaborative way.
Many wonder whether we could apply this model to deep learning. In other words, could we train a large neural network over the crowd? Frontier model training is one of the most computationally-intensive endeavors in human history. Like many @home projects, the cost now exceeds the reach of all but the biggest players. This threatens to stall future progress, as we rely on an ever-shrinking group of companies to find new breakthroughs. It also concentrates control of our AI systems in the hands of a small few. Regardless of your views on the technology, this is a future that should give you pause.
Most critics dismiss the idea of decentralized training, on the basis that it does not pair well with current training techniques. This view is increasingly outdated. New techniques have emerged that minimize inter-node communication requirements, allowing efficient training over poorly-connected devices. These include works like DiLoCo, SWARM Parallelism, lo-fi, and Decentralized Training of Foundation Models in Heterogeneous Environments, among others. Many of these are fault tolerant and accommodate heterogeneous compute. Others have proposed new architectures that are natively designed for decentralized networks, including DiPaCo and Decentralized Mixture of Experts.
We’ve also seen various cryptographic primitives begin to mature, allowing networks to coordinate resources on a global scale. These enable use cases like digital money, cross-border payments, and prediction markets. Unlike earlier volunteer projects, these networks can amass staggering amounts of compute, often several orders of magnitude greater than the largest cloud training clusters currently being conceived.
Together, these pieces lay the foundation for a new model training paradigm. One that leverages all of the world’s compute, including the vast supply of edge devices that could be used if they were networked together. This would drive down costs by enabling new competition for most training workloads. It could also unlock new forms of training, where model development is collaborative and modular, rather than siloed and monolithic. Models could source compute and data from the crowd, continuing to learn in real time. Individuals could own a piece of the models they create. And researchers could once again openly share novel research, freed from the requirement to monetize their findings to recoup expensive compute budgets.
This report examines the current state of large model training and the various costs it imposes. It reviews prior distributed computing efforts - from SETI to Folding to BOINC - to draw inspiration for what an alternate path might look like. It raises historical challenges with decentralized training, before turning to recent breakthroughs that might help overcome them. It concludes with the opportunity ahead and the challenges that remain.
The State of Frontier Model Training
The cost of frontier model training has become prohibitive for all but the biggest players. This trend is not new but is empirically getting worse, as frontier labs continue to push the scaling hypothesis as far as it will go. OpenAI reportedly spent over $3B this year on training. Anthropic predicts we’ll have $10B training runs beginning in 2025, with $100B models not far behind.
This is driving concentration within the industry, as only a handful of companies can afford to participate. This raises core policy questions about the future - are we comfortable with a world where all of the leading AI systems are controlled by one or two companies? More immediately, it also limits the rate of progress. We see this today within the research community, as smaller labs cannot afford the compute they need to scale their experiments. We hear this repeatedly from industry leaders:
Meta’s Joe Spisak:
To truly understand the capabilities of the [model] architecture you kind of have to push the scale and I think that's what's missing right now in the ecosystem. If you look at academia – academia has a lot of absolutely brilliant people but they don't have a lot of access to compute and that's a problem because they have these great ideas but they have no way to truly execute them at the level that's needed.
Together’s Max Ryabinin:
The need for costly hardware weighs heavily on the research community. Most researchers cannot contribute to the development of large neural networks because conducting the necessary experiments would be too expensive for them. If we continue to increase the model size by scaling up, eventually the only labs that can conduct competitive research will be those with massive budgets.
Google’s Francois Chollet:
We know that LLMs fall short of AGI. And meanwhile progress towards AGI has stalled. The limitations that we are dealing with with LLMs are still the exact same we were dealing with five years ago. We need new ideas. We need new breakthroughs. My bet is that the next breakthrough will likely come from an outsider, while all the big labs are busy training bigger LLMs.
Some dismiss these concerns, believing that hardware improvements and cloud capex will solve the problem. This seems unlikely. On the one hand, FLOP count on new generation Nvidia chips will increase substantially by end of decade, likely 10x over today’s H100s. This will drive down the price per FLOP by 80-90%. Likewise, projected cloud capex will increase the aggregate FLOP supply by roughly 20x by end of decade, while also driving improvements in networking and related infrastructure. All of this will enable more efficient training on a per dollar basis.
At the same time, aggregate FLOP demand will also rise substantially, as labs look to push further on scale. If the current decade-long trend in training compute holds, we’ll see frontier training runs around 2e29 FLOPs by 2030. Training at this scale would require roughly 20 million H100-equivalent GPUs, given current training run-times and utilization rates. Assuming there will be multiple frontier labs still in this arena, the total number of FLOPS required is several multiples higher, since the aggregate supply will be split between them. In total, EpochAI projects we’ll need on the order of 100 million H100-equivalents by then – roughly 50x 2024 shipments. SemiAnalysis projects a similar forecast, with frontier training demand and GPU supply growing roughly in tandem over this period.
Capacity could get even tighter for various reasons. For example, if manufacturing bottlenecks delay projected ship cycles, as often happens. Or if we fail to produce enough energy to power the data centers, as many analysts project. Or if we struggle to connect these power sources to the grid, as is happening today. Or if the growing scrutiny on capex eventually catches up, causing the industry to scale back. And so on. At best, our current approach will equip a handful of companies to continue advancing the state of research. And it might not even be enough to do that.
It’s clear we need a new approach. One that does not require perpetual scaling of data centers, capex, and energy consumption in order to find the next breakthrough. But rather, one that utilizes our infrastructure efficiently, scaling elastically as demand ebbs and flows over time. This would allow more experimentation in research, since training runs would no longer need to underwrite ROI on hundred-million dollar compute budgets. Freed from this constraint, we could push beyond the current LLM paradigm, as many believe necessary to reach AGI.
To understand what this alternative might look like, we can draw inspiration from distributed computing efforts of the past.
Crowd Computing: A Brief History
SETI@home popularized the idea in 1999, allowing millions of participants to analyze radio signals in search of extraterrestrial intelligence. SETI would collect electromagnetic data from the Arecibo telescope, split it into batches and send it to users over the internet. Users would analyze the data in the background of their normal processing activities and send the results back. Users did not need to communicate with each other, and batches could be reviewed independently, making the effort highly parallelizable. At its peak, SETI@home had over 5 million participants and exceeded the processing power of the largest supercomputer at the time. It eventually shut down in March 2020, but its success inspired the volunteer compute movement that would follow.
Folding@home picked up the idea in 2000, leveraging edge compute to simulate protein folding in diseases like Alzheimer's, cancer, and Parkinson's. Volunteers ran protein simulations on their personal computers during idle times, helping researchers study how proteins misfold and contribute to disease. At various points throughout its history, Folding’s compute capacity exceeded the largest supercomputer of its time, including throughout the late 2000s and again during COVID, when it became the first distributed computing project to surpass one exaFLOPS of compute. Folding researchers have produced over 200 scientific, peer-reviewed papers since inception, in each case enabled by volunteer compute.
Berkeley Open Infrastructure for Network Computing (BOINC) generalized the idea even further in 2002, providing a platform to crowdsource compute for various research projects. It supported projects like SETI@home and Folding@home, as well as new initiatives in areas such as astrophysics, molecular biology, mathematics, and cryptography. As of 2024, BOINC lists 30 active projects, and nearly 1,000 scientific papers produced using its compute network.
Outside of the research field, volunteer compute has been used to train engines for games like Go (LeelaZero, KataGo) and chess (Stockfish, LeelaChessZero). LeelaZero was trained from 2017 to 2021 using volunteer compute, allowing it to play over ten million games against itself, and creating one of the strongest Go engines available today. Similarly, Stockfish has been continuously training on a volunteer network since 2013, making it one of the most popular and powerful chess engines in existence.
Challenges with Deep Learning
But could we apply this model to deep learning? Could we network together the world's edge devices to create a low-cost, public training cluster? Consumer hardware - from Apple laptops to Nvidia gaming cards - are becoming increasingly performant at deep learning. In many cases, these devices even exceed the performance of data center cards on a per-dollar basis.
However, to utilize these resources in a distributed setting, we would need to overcome various challenges.
First, current distributed training techniques assume heavy inter-node communication.
Today’s frontier models have grown so large that training must be split across thousands of GPUs. This is done using a variety of parallelism techniques, splitting either the model, the dataset, or both across the available GPUs. This generally requires high-bandwidth, low-latency networking, otherwise nodes will sit idle waiting for data to arrive.
For example, distributed data parallelism (DDP) splits the dataset across the GPUs, with each one training the full model on their particular shard of data, and then sharing their gradient updates to produce the new model weights at various steps. This requires relatively limited communication overhead, since nodes only share updates after each backward pass, and collective communication can partly overlap with computation. However it only works for smaller models, since it requires each GPU to hold the entire model weights, activations, and optimizer state in memory. For example, GPT-4 required over 10 terabytes of memory in training, while a single H100 has 80GB.
To overcome this issue, we also split the model across GPUs using various techniques. For example, tensor parallelism splits individual weights within a single layer across GPUs, with each one performing the necessary operations and communicating the outputs to the others. This lowers the memory requirements for each GPU but requires constant communication between them, necessitating high-bandwidth, low-latency connections to maximize efficiency.
Pipeline parallelism splits the layers of the model across GPUs, with each one performing its work and sharing the updates with the next GPU in the pipeline. While this requires less communication than tensor parallelism, it can suffer from “bubbles” (e.g. idle periods) where GPUs in the later stages of the pipeline wait for data from the earlier GPUs in order to begin their work.
To address these issues, various techniques have emerged. For example, ZeRO (Zero Redundancy Optimizer) is a memory optimization technique that trades off higher communication for reduced memory, enabling larger models to train on a given set of devices. ZeRO reduces memory usage by sharding model parameters, gradients, and optimizer states across GPUs, but relies on heavy communication for devices to fetch the sharded data. It is the method behind popular techniques like fully sharded data parallelism (FSDP) and DeepSpeed.
These techniques are often combined in large model training to maximize utilization, known as 3D parallelism. In this setup, tensor parallelism is often used to split the weights across GPUs within a single server, since it requires heavy communication at every partitioned layer. Pipeline parallelism is then used to split the layers between GPUs in different servers (but within the same island in the data center), since it requires less communication. And then data parallelism (or FSDP) is used to split the data set between islands of servers, since it can tolerate longer network latency by sharing updates asynchronously and/or compressing gradients. Meta used this approach to train Llama 3.1, as depicted in the diagram below.
These methods pose a core challenge for decentralized training networks, which rely on devices connected over the (much slower and more variable) consumer internet. In this environment, communication costs can quickly exceed the benefit of aggregating edge compute, since devices are often sitting idle waiting for data to arrive. Using a simple example, distributed data parallel training of a 1B parameter model in half precision would require each GPU to share 2GB of data at each optimization step. With typical internet bandwidth (e.g. 1 gigabit per second), assuming non-overlapping compute and communication, communicating the gradient updates will take at least 16 seconds, leading to significant idle time. Techniques like tensor parallelism (which require heavier communication) would of course fare even worse.
Second, current training techniques are not fault tolerant.
Like any distributed system, training clusters become more susceptible to failure as their scale increases. However this issue is magnified with training, since our current techniques are largely synchronous, meaning GPUs must work together in lockstep as they move through the model. Failure of a single GPU among thousands can halt the entire training run, forcing the other GPUs to restart from the last checkpoint. In some cases, GPUs don’t fail outright but slow down for various reasons, in turn slowing down thousands of other GPUs in the cluster. This can mean tens or hundreds of millions of dollars in added costs, given the scale of today’s clusters.
Meta detailed these issues in their Llama training run, where they experienced over 400 unexpected interruptions, or roughly 8 per day of training. The vast majority of these were attributed to hardware issues, such as GPU or host component failures. This contributed to their low GPU utilization rate (38-43%). OpenAI did even worse with GPT-4 (32-36%), also due to the high frequency of failures during the training run.
In other words, frontier labs routinely struggle to achieve 40% utilization, despite training within fully-optimized environments that include homogeneous, state-of-the-art hardware, networking, power, and cooling. Much of this owes to hardware failure and networking issues, which would only be magnified within edge training environments, where devices have uneven processing power, bandwidth, latency, and reliability. Not to mention, decentralized networks are susceptible to malicious actors, who may look to subvert the broader project, or cheat on their particular workload, for various reasons. Even SETI@home, a pure volunteer network, saw cheating from various participants.
And third, frontier model training requires massive scale.
While projects like SETI and Folding reached impressive scale, they are dwarfed by the compute requirements of today’s frontier training runs. GPT-4 was trained on a cluster of 20,000 A100s, which had a peak throughput of 6.28 ExaFLOPS at half precision. This is three times more compute than Folding@home amassed at its peak. Llama 405b was trained on 16,000 H100s, with a peak throughput of 15.8 ExaFLOPS (7 times Folding’s peak). This gap will only continue to grow, as multiple labs are planning to build 100k+ H100 clusters, which pack a staggering 99 ExaFLOPS each.
This makes sense, as the @home projects were volunteer based. Contributors were donating their memory and processor cycles and incurring all of the associated costs. This naturally caps their scale relative to for-profit alternatives.
Recent Advances
While these issues have historically plagued decentralized training efforts, they no longer appear insurmountable. New training techniques have emerged that minimize inter-node communication requirements, allowing efficient training over internet-connected devices. Many of these have originated from the large labs themselves, who are looking to add greater scale to model training, and therefore need communication efficient techniques that can span across data centers. We’ve also seen advances in fault tolerant training methods as well cryptographic incentive systems, which together can enable larger scale training within edge environments.
Communication Efficient Techniques
DiLoCo is a recent work from Google, which reduces communication overhead by performing local optimization updates before communicating updated model states across devices. Their method (which builds on earlier work in federated learning) shows comparable performance to traditional synchronous training with 500 times less communication between nodes. It has since been replicated by others and scaled to train even larger models (over 1B parameters). It’s also been extended into the asynchronous setting, meaning nodes can share gradient updates at different times, rather than all at once. This better accommodates edge hardware that might have different processing power and networking speeds.
Other data parallel methods like lo-fi and DisTrO aim to reduce communication costs even further. Lo-fi proposes completely local fine-tuning, meaning nodes train independently and only communicate weights to each other at the end (rather than every step). This method matches the baseline's performance when fine-tuning 1B+ language models while removing the communication requirement entirely. In a preliminary report, DisTrO claims to use a novel set of distributed optimizers that they believe can reduce communication requirements by four to five orders of magnitude (though their method is still to be confirmed).
New model parallel approaches have emerged as well, which can enable even greater scale. DiPaCo (also from Google) splits the model into multiple blocks, each comprising distinct expert modules. The training data is then sharded by “paths”, which are sequences of experts that each data sample corresponds to. Given a shard, each worker can train a specific path almost independently, except for the communication required for modules shared across multiple paths, which is handled using DiLoCo. This architecture cuts the training time of a billion-parameter model by more than half.
SWARM Parallelism and Decentralized Training of Foundation Models in Heterogeneous Environments (DTFMHE) also propose model parallel methods to enable large model training within heterogeneous environments. SWARM finds that pipeline parallelism becomes less communication-constrained as model size increases, making it possible to train larger models efficiently with lower network bandwidth and higher latency.
To incorporate this idea in a heterogeneous environment, they use temporary “pipelines” between nodes that can be updated on the fly during each iteration. This allows nodes to send their outputs to any peer that serves the next pipeline stage. This means that if one peer is faster than others, or if any participant disconnects, outputs can be re-routed dynamically to ensure training continues, so long as there is at least one active participant per stage. They use this method to train a 1B+ parameter model on low-cost, heterogeneous GPUs with slow interconnect (as depicted below).
DTFMHE similarly proposes a novel scheduling algorithm, along with pipeline and data parallelism, in order to train a large model across devices on 3 continents. Their method is only 1.7-3.5x slower than using standard Deepspeed inside a data center, even though their network is 100x slower. Like SWARM, DTFMHE shows that communication costs can effectively be hidden as models get bigger, even within a geographically-distributed network. This lets us circumvent weaker connectivity between nodes through various techniques, including increasing hidden layer sizes and adding more layers per pipeline stage.
Fault Tolerance
Many of the data parallel methods above are fault tolerant by default, since every node is storing the full model in memory. This redundancy generally means that nodes can continue working independently even if others fail. This is important for decentralized training, where nodes are often unreliable, heterogeneous, and potentially even malicious. However, pure data parallel methods only work for smaller models, as noted, and therefore model size is limited to the memory capacity of the smallest node in the network.
To overcome this, some have proposed fault tolerant techniques that extend to model parallel (or hybrid parallel) training as well. SWARM handles peer failures by prioritizing stable peers with lower latency, and rerouting pipeline stages in the event of failures. Others like Oobleck use a similar idea whereby multiple “pipeline templates” are created and used to provide redundancy in the event that some portion of the nodes fail. While tested in a datacenter, Oobleck’s approach provides strong reliability guarantees that can be extended to the decentralized setting as well.
We’ve also seen new model architectures (like Decentralized Mixture of Experts (DMoE)) that are designed to enable fault tolerant training within decentralized environments. Like traditional mixture of experts, DMoE consists of multiple independent “expert” sub-networks that are distributed over a pool of workers. DMoE uses a Distributed Hash Table to track and incorporate asynchronous updates in a decentralized way. This mechanism (also used in SWARM) is resilient to node failure, as it can exclude certain experts from averaging if those nodes fail or do not respond in time.
Scale
Finally, cryptographic incentive systems like those used in Bitcoin and Ethereum can help aggregate necessary scale. Both networks crowdsource compute by paying contributors in a native asset that can appreciate as adoption grows. This design incentivizes early contributors by granting them outsized rewards, which can then taper off once the network reaches minimum-viable scale.
To be sure, there are various pitfalls with this mechanism that must be avoided. Principally among them, over-incentivizing supply without generating corresponding demand. Likewise, this design can pose regulatory issues if the underlying network is not sufficiently decentralized. However, when designed properly, decentralized incentive systems can amass meaningful scale over sustained periods.
For example, Bitcoin currently consumes roughly 150 terawatt hours (TWh) of electricity annually. This is two orders of magnitude greater than the power consumption of the largest AI training clusters currently being conceived (100,000 H100s running at full capacity for a year). For reference, OpenAI’s GPT-4 was trained on 20,000 A100s. Meta’s flagship Llama 405B model was trained on 16,000 H100s. Likewise, at its peak, Ethereum consumed roughly 70 TWh, spread across several million GPUs. Even accounting for the massive ramp in AI data centers over the coming years, incentivized compute networks like these will continue to exceed their scale multiple times over.
Of course, not all compute is fungible, and training has unique requirements relative to mining that must be considered. Nonetheless, these networks are instructive as to the scale that can be achieved using these mechanisms.
The Road Ahead
Connecting these pieces together, we can see the beginning of a new path forward.
Soon, new training techniques will allow us to move beyond the data center, as devices no longer need to be co-located to be useful. This will take time, as our current methods for decentralized training are still sub-scale, mostly topping out in the 1-2B parameter range – a few orders of magnitude smaller than models like GPT-4. We’ll need further breakthroughs to scale these methods without sacrificing key properties, like communication efficiency and fault tolerance. Or, we’ll need new model architectures that look different than today’s large monoliths - likely smaller and more modular, running on the edge instead of in the cloud.
In either case, it’s reasonable to bet on further progress in this direction. The costs of our current approach are unsustainable, providing a strong market incentive to innovate. We see this already with manufacturers like Apple building stronger edge devices to run more workloads locally than in the cloud. We also see growing support for open source solutions - even among companies like Meta - to enable more decentralized research and development. These trends will only accelerate over time.
In parallel, we also need new networking infrastructure to connect edge devices so they can be used in this way. These include devices like laptops, gaming desktops, and eventually even phones that have high-performance graphics cards and large memory capacity. This would allow us to build a “global cluster” of low-cost, always-on compute that we can parallelize training tasks across. This is a challenging problem as well, requiring further progress on multiple fronts.
We need better scheduling techniques for training within heterogeneous environments. No method currently exists to automatically parallelize a model optimally over a cluster of devices with heterogeneous hardware and interconnect, particularly where devices can drop or join at any time. This is a critical next step to optimize training without losing the scale benefits of edge-based networks.
We must also grapple with the complexities of decentralized networks in general. To maximize scale, the network should be built as an open protocol -- a set of standards and instructions that specify how participants engage, like TCP/IP but for machine learning computation. This would allow it to network together any device that follows certain specifications, regardless of who owns it or where it’s located. It also ensures the network remains neutral, allowing users to train whatever models they prefer.
While this maximizes scale, it also requires a mechanism to verify the correctness of all training tasks that does not rely on a single party. This is critical given the inherent incentive to cheat – e.g. claim you performed a training task to earn money without actually doing it. This is especially challenging given that different devices often compute machine learning operations differently, making it difficult to verify correctness using standard replication techniques. Solving this problem correctly requires deep work in cryptography and other disciplines.
Fortunately, we continue to see progress on all of these fronts. Unlike in years past, these challenges no longer appear insurmountable. They’re also quite small relative to the opportunity. Google summarized it best in their DiPaCo paper, noting the negative feedback loop that decentralized training has the potential to break:
[P]rogress in distributed training of ML models may allow simpler infrastructure buildouts, leading eventually to more available compute. As it is, infrastructure is designed around the standard approach to training large monoliths; and ML models are architected to take advantage of the current infrastructure and training approaches. This feedback loop may be leading the community to a spurious local minimum where compute is more constrained than it needs to be.
Perhaps most exciting of all, there is a growing enthusiasm in the research community toward solving these problems. Our team at Gensyn is building the networking infrastructure described above. Others like Hivemind and BigScience have implemented many of these techniques in practice. Projects like Petals, sahajBERT, and Bloom demonstrate their capabilities, as well as a growing appetite for community-based machine learning in general. Many others are contributing to the state of research as well, with a goal of building a more open and collaborative model training ecosystem. If this work interests you, please reach out to get involved.