RL Swarm: a framework for collaborative RL

This is open source code (MIT Licence) for peer-to-peer nodes that perform collaborative reinforcement learning over the internet, accessible by anyone on consumer or datacentre hardware.

We’ve long believed that the future of machine learning is both decentralised and fragmented. Our current monolithic models will be replaced by sharded parameters existing over every device in the world. In our research, we explore different paths towards this future, and recently we’ve discovered that reinforcement learning (RL) post-training works particularly well when models train collaboratively by communicating and critiquing each other’s answers. 

Specifically, we find that models training with RL learn faster when they train as a collective swarm than they do on their own. To understand why, read more here. You can also participate directly in our live demo to see it in practice.

‍How it works

In our setup, each swarm node runs the Qwen 2.5 1.5b model, and engages with math problems (GSM8K) in three stages:

  • Stage 1 (Answering): Each model independently solves the problem, outputting its reasoning and answer in a specified format.
  • Stage 2 (Critiquing): Each model looks at the answers provided by its peers and provides its feedback.
  • Stage 3 (Resolving): Each model votes for what the majority will think the best answer is for each question. Then, each model produces a final revised answer to the prompt.

In our experiments we found that this system speeds up the learning process, allowing models to produce better answers on unseen test data with fewer training steps.

The plots above show Tensorboard results from a node in the swarm. In each plot we see cyclic behavior as an artifact of “resets” between rounds of the multi-stage game. All plots have the time since joining the swarm as an x-axis. For the y-axis, going from the leftmost plot to the right, we see:

i) The “consensus correctness reward", which captures when this swarm participant correctly formatted its choices for best answer AND its chosen answer was indeed mathematically correct;

ii) The total reward, which is a weighted sum of several rule-based rewards (i.e., checking formatting and mathematical/logical correctness of responses);

iii) The training loss, which captures the feedback signal for reward maximisation being propagated to update the “base” LLM; and

iv) The response completion length for the model, which captures the number of tokens in the output response (this shows that the models learn to be more concise when critiqued by their peers).


Join swarm

To further demonstrate this, and to scale up our experiments, we're releasing a live demo you can join at home. It is a fully open source framework for creating RL training swarms over the internet. 

Running a swarm-node allows you to launch a new swarm or connect to an existing node using its public address and join an existing swarm. Each swarm performs RL reasoning as a group, with a gossiping system (using Hivemind) for collaborative improvement between models. Running the bundled client allows you to connect to a swarm, listen to messages, and train your own model locally as part of the collective. In the future we'll launch more experiments for the swarm to run and we'd love to see wide participation from the crowd.

RL Swarm is fully open and permissionless, meaning you can run it on a basic consumer laptop at home or on a powerful GPU in the cloud.

Join Swarm

Read Report


***

RL Swarm is a look into the future of machine learning. It provides a framework for collaborative RL among peers, where intelligence leverages the wisdom of the crowd, rather than a handful of closed labs. Scaling it further will require an open compute network that connects every device in the world, which we're excited to share more about soon.