README.md 3.2 KB
Newer Older
Amedeo Sapio's avatar
Amedeo Sapio 已提交
1
2
# SwitchML: Switch-Based Training Acceleration for Machine Learning

AmedeoSapio's avatar
AmedeoSapio 已提交
3
SwitchML accelerates the Allreduce communication primitive commonly used by distributed Machine Learning frameworks. It uses a programmable switch dataplane to perform in-network computation, reducing the volume of exchanged data by aggregating vectors (e.g., model updates) from  multiple  workers  in  the  network.  It provides an end-host library that can be integrated with ML frameworks to provide an efficient solution that speeds up training for a number of real-world benchmark models.
Amedeo Sapio's avatar
Amedeo Sapio 已提交
4

AmedeoSapio's avatar
AmedeoSapio 已提交
5
The switch hardware is programmed with a [P4 program](/dev_root/p4) for the [Tofino Native Architecture (TNA)](https://github.com/barefootnetworks/Open-Tofino) and managed at runtime through a [Python controller](/dev_root/controller) using BFRuntime. The [end-host library](/dev_root/client_lib) provides simple APIs to perform Allreduce operations using different transport protocols. We currently support UDP through DPDK and RDMA UC. The library has already been integrated with ML frameworks as a [NCCL plugin](/dev_root/frameworks_integration/nccl_plugin).
6

Amedeo Sapio's avatar
Amedeo Sapio 已提交
7
8
## Getting started
To run SwitchML you need to:
9
10
- compile the P4 program and deploy it on the switch (see the [P4 code documentation](/dev_root/p4))
- run the Python controller (see the [controller documentation](/dev_root/controller))
11
- compile and run the end-host program using the end-host library (see the [library documentation](/dev_root/client_lib))
Amedeo Sapio's avatar
Amedeo Sapio 已提交
12

13
The [examples](/dev_root/examples) folder provides simple programs that show how to use the APIs.
Amedeo Sapio's avatar
Amedeo Sapio 已提交
14
15

## Repo organization
16
The SwitchML repository is organized as follows:
Amedeo Sapio's avatar
Amedeo Sapio 已提交
17
18
19
20
21

```
docs: project documentation
dev_root:
  ┣ p4: P4 code for TNA
AmedeoSapio's avatar
AmedeoSapio 已提交
22
  ┣ controller: switch controller program
Amedeo Sapio's avatar
Amedeo Sapio 已提交
23
24
25
26
  ┣ client_lib: end-host library
  ┣ examples: set of example programs
  ┣ benchmarks: programs used to test raw performance
  ┣ frameworks_integration: code to integrate with ML frameworks
AmedeoSapio's avatar
AmedeoSapio 已提交
27
28
29
  ┣ third_party: third party software
  ┣ protos: protobuf description for the interface between controller and end-host
  ┗ scripts: helper scripts
Amedeo Sapio's avatar
Amedeo Sapio 已提交
30
31
32
```

## Testing
33
The [benchmarks](/dev_root/benchmarks) contain a benchmarks program that we used to measure SwitchML performances.
AmedeoSapio's avatar
AmedeoSapio 已提交
34
In our experiments (see benchmark documentation for details) we observed a more than 2x speedup over NCCL when using RDMA. Moreover, differently from ring Allreduce, with SwitchML performance are constant with any number of workers.
Amedeo Sapio's avatar
Amedeo Sapio 已提交
35

36
![Benchmarks](/docs/img/benchmark.png)
Amedeo Sapio's avatar
Amedeo Sapio 已提交
37
38
39

## Publication

40
> [Scaling Distributed Machine Learning with In-Network Aggregation
Amedeo Sapio's avatar
Amedeo Sapio 已提交
41
> A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. R. K. Ports, P. Richtarik.
42
> In Proceedings of NSDI’21, Apr 2021.](https://www.usenix.org/conference/nsdi21/presentation/sapio)
Amedeo Sapio's avatar
Amedeo Sapio 已提交
43
44
45

## Contributing
This project welcomes contributions and suggestions.
46
To learn more about making a contribution to SwitchML, please see our [Contribution](/CONTRIBUTING.md) page.
Amedeo Sapio's avatar
Amedeo Sapio 已提交
47
48

## The Team
AmedeoSapio's avatar
AmedeoSapio 已提交
49
SwitchML is a project driven by the [P4.org](https://p4.org) community and is currently maintained by Amedeo Sapio, Omar Alama, Marco Canini, Jacob Nelson.
Amedeo Sapio's avatar
Amedeo Sapio 已提交
50
51

## License
52
SwitchML is released with an Apache License 2.0, as found in the [LICENSE](/LICENSE) file.