SkipPipe: a communication efficient method for decentralised training

This is an academic paper for efficient communication in pipeline parallel training. It introduces an optimal scheduling algorithm that maximises performance and fault tolerance whilst minimising convergence impact from layer skips. It reduces distributed training iteration time by up to 55% and is robust against a 50% node failure rate at inference time.

Recent advances in large language models have been driven by scale. Bigger data sets and parameter counts have produced better models. While this trend has shown to predictably increase performance, it has also pushed development costs to breaking point, as models must now be split across thousands of expensive, inter-connected nodes during training.

To solve this issue, we need new methods that limit inter-node communication during training. This unlocks training over geographically-distributed hardware, eliminating a key bottleneck on today’s supply.

Most of the early research in this field has focused on data parallel methods, where each node independently trains a copy of the model and only shares gradient updates at some infrequent interval. These methods are a good starting point, as they are inherently communication efficient. However they do not scale well, as they require every node to store the full model in memory, thereby capping the model size to the memory capacity of the smallest participating node.

Introducing SkipPipe

Together with researchers from University of Neuchatel and TU Delft, we have developed SkipPipe to address this issue, a fault tolerant, pipeline parallel method that skips and reorders stages dynamically to optimize training within decentralised environments. SkipPipe shows a 55% reduction in training time compared to standard pipeline methods within these environments, with no degradation in convergence.

It is also highly fault tolerant - demonstrating robustness up to 50% node failure rate with only 7% perplexity loss at inference time (i.e. when half of the pipeline nodes for a single model are unavailable we only lose 7% perplexity running inference through the - now sparse - model).

Unlike existing data parallel methods, SkipPipe can accommodate large model training. Since it shards the model itself across nodes, rather than simply sharding the dataset, SkipPipe reduces the memory footprint on each individual node and removes the cap on model size - allowing models of theoretically infinite size to be built across distributed, and decentralised, infrastructure.

How it Works

SkipPipe builds on conventional pipeline parallelism by dynamically choosing which stages to execute for each microbatch, rather than processing every stage sequentially. In traditional pipelines, every microbatch travels through all layers of the model, meaning that if one stage is delayed, all subsequent stages must wait. SkipPipe instead lets you define a skip ratio (k%), allowing certain layers to be bypassed for a microbatch if they would otherwise create a bottleneck.

SkipPipe uses a novel scheduling algorithm to examine the available computational paths across the network and select the optimal route. This minimises GPU idle time while also enhancing fault tolerance by enabling the system to bypass failed or slow nodes.

Learn More

SkipPipe provides a core building block for distributed (and decentralised) training by bringing both communication efficiency and fault tolerance to pipeline parallelism, where existing work has focussed solely on data parallelism. By focussing on pipeline parallelism, we remove the cap on model size from existing methods, allowing individual models to be extended over multiple distributed nodes, rather than simply duplicated and trained in data-parallel.

When combined with a coordination system, and a trustless verification method, SkipPipe enables the efficient training of enormous frontier models across crowd-sourced compute.

To learn more, you can read the full paper here.

SkipPipe is fully open source and we encourage the research community to build on top of the code.