Unlocking FedNL on arXiv

The paper Unlocking FedNL: Self-Contained Compute-Optimized Implementation has been published on arXiv: arXiv:2410.08760.

Existing work on Federated Newton Learn (FedNL) by Mher Safaryan, Rustem Islamov, Xun Qian, and Peter Richtárik was presented at the 39th International Conference on Machine Learning (ICML 2022).

In the presented research, Konstantin Burlachenko and Peter Richtárik made significant strides in improving the practicality of the FedNL algorithm family, achieving a 1000x improvement in the wall clock time over the original version, while also demonstrating superior performance compared to MOSEK, Apache Spark, Ray.

Abstract

Federated Learning (FL) is an emerging paradigm that enables intelligent agents to collaboratively train Machine Learning (ML) models in a distributed manner, eliminating the need for sharing their local data. The recent work (arXiv:2106.02969) introduces a family of Federated Newton Learn (FedNL) algorithms, marking a significant step towards applying second-order methods to FL and large-scale optimization. However, the reference FedNL prototype exhibits three serious practical drawbacks: (i) It requires 4.8 hours to launch a single experiment in a sever-grade workstation; (ii) The prototype only simulates multi-node setting; (iii) Prototype integration into resource-constrained applications is challenging. To bridge the gap between theory and practice, we present a self-contained implementation of FedNL, FedNL-LS, and FedNL-PP for single-node and multi-node settings. Our work resolves the aforementioned issues and reduces the wall clock time by x1000. With this FedNL outperforms alternatives for training logistic regression in a single-node - CVXPY (arXiv:1603.00943), and in a multi-node - Apache Spark (arXiv:1505.06807), Ray/Scikit-Learn (arXiv:1712.05889). Finally, we propose two practical-orientated compressors for FedNL - adaptive TopLEK and cache-aware RandSeqK, which fulfill the theory of FedNL.

Results

1. Compute Effectiveness. We addressed the challenge of executing computationally intensive multi-node simulations, achieving a remarkable x1000 improvement in wall-clock time in the same hardware.

Table 1: Single-node simulation, n=142, FedNL(B), r=1000, λ=0.001, α - option 2, FP64, 24 cores@3.3 GHz.
Client Compression	$$\\|\nabla f(x^{last})\\|$$	Total Time (seconds)
1. RandK[K=8d] (We)	$$3.00 \cdot 10^{-18}$$	$$18.84$$
2. RandK[k=8d] (Base)	$$3.00 \cdot 10^{-18}$$	$$17\,510.00$$
3. TopK[K=8d] (We)	$$2.80 \cdot 10^{-18}$$	$$18.72$$
4. TopK[k=8d] (Base)	$$2.80 \cdot 10^{-18}$$	$$19\,770.00$$
5. RandSeqK[K=8d] (We)	$$3.19 \cdot 10^{-18}$$	$$16.70$$
6. TopLEK[K=8d] (We)	$$3.45 \cdot 10^{-18}$$	$$18.48$$
7. Natural (We)	$$3.10 \cdot 10^{-18}$$	$$27.02$$
8. Ident (We)	$$2.46 \cdot 10^{-18}$$	$$24.12$$

2. Self-Contained Design. Our implementation facilitates seamless integration into resource-constrained systems:

No library dependencies. Sole reliance on OS interfaces and some C++ Runtime elements
Supports both single-node simulation and real multi-node execution
Compatible with Linux, macOS, and Windows OS
Buildable with MSVC (minimum 19.30), GNU GCC/G++ (minimum 11.4), LLVM CLANG (minimum 10.0)
Tested on various CPU architectures (AMD64/x86_64, ARM32 v7, ARM64 v8, PowerPC PPC64LE, RISC-V 64, IBM Z Series
Explicit (optional) utilization of processor supplementary instruction sets: SSE2, AVX2, AVX-512, ARM Neon
Modular design for handling complexity and ensuring code quality
Support for creating oracles for optimization problems using NVIDIA CUDA (Compute Capability 7.0 to 9.0)
Implementation can be utilized as native OS executable binaries, dynamic libraries, static libraries, and extension modules for languages supported by SWIG.

3/4. Two Practical Refinements of Existing Compressors. We introduced an extension of the TopK compression mechanism, termed Top-LEK, which performs adaptive compression based on theoretical limits. Additionally, we proposed a cache-aware version of the RandK sparsification compressor, named RandSeqK.

5. Outpefrom best-practice solutions. Our implementation significantly outperforms solvers from CVXPY capable of training logistic regression in a single-node scenario including commercial MOSEK. Below is the initialization and solve time, for extra metrics see Appendix F.

Table 2: Single-node simulation, n=142, FedNL-LS (B), FP64, 24 cores@3.3 GHz.
Solver	W8A $$d=301,n_i=350$$	A9A $$d=124,n_i=229$$	Phishing $$d=69,n_i=77$$
Initialization Time (seconds)
CVXPY	+2.54	+2.33	+2.28
FedNL	+0.939	+0.196	+0.081
Solving Time (seconds)
CLARABEL	19.24	10.83	2.50
ECOS	22.22	8.02	2.55
ECOS-BB	22.00	8.00	2.12
SCS	31.14	19.36	4.57
MOSEK	16.90	9.59	3.55
FedNL-LS/RandK[k=8d]	4.35	0.34	0.12
FedNL-LS/RandSeqK[k=8d]	3.34	0.29	0.06
FedNL-LS/TopK[k=8d]	4.49	0.46	0.10
FedNL-LS/TopLEK[k=8d]	4.79	0.34	0.61
FedNL-LS/Natural	3.13	0.17	0.08
FedNL-LS/Identical	0.63	0.09	0.06

Our implementation outperforms solvers from Apache Spark and Ray in a multi-node (number of nodes is 50) in the logistic regression benchmarks.

Table 3: Multi-node setting, n=50 clients, 1 master, FP64, 1 CPU core/node.
Solution	W8A $$d=301,n_i=994$$	A9A $$d=124,n_i=651$$	Phishing $$d=69,n_i=221$$
Initialization Time (seconds)
Ray	+52.0
Spark	+25.82
FedNL	+1.1
Solving Time (seconds)
Ray	116.17	28.13	11.54
Spark	36.65	33.59	33.14
FedNL/RandK[k=8d]	12.6	4.52	0.21
FedNL/RandSeqK[k=8d]	12.56	5.10	0.14
FedNL/TopK[k=8d]	12.20	5.79	5.23
FedNL/TopLEK[k=8d]	15.11	3.26	0.82
FedNL/Natural	5.75	1.56	0.14

6. First Robust Practical Implementation. We hold respect for contributors to the diverse landscape of training runtimes for FL. However, to the best of our knowledge, our implementation represents the first robust practical solution for training (strongly) convex objectives in Federated Learning settings (see Section 9.4 of the paper).

Broader Impact

The Role of Theory in Practice. The interplay between theory and practice remains a landscape still being explored. In our work, we explicitly identify the theoretical elements that facilitated our efforts to bridge the gap between theoretical concepts and practical implementation (see Appendix K.2).

Translating Theory into Practice. Our work serves as a guiding beacon for researchers navigating the complex process of translating theoretical algorithms into impactful implementations across various domains of Machine Learning. Even theoretically optimal algorithms may underperform in practical benchmarks that assess the actual time and memory. It’s often due to significant hidden implementation constants in layered designs. We highlight a comprehensive set of considerations that must be considered to successfully implement a scientific algorithm without sacrificing performance.

The Significance of Alternative Languages in ML Research. Our work emphasizes the multifaceted considerations involved in improving actual wall clock time. It challenges the predominant Python-centric design philosophy in Machine Learning and underscores the significance of considering alternative languages when prioritizing computational and memory efficiency.

Postscriptum

Much like the bonus levels in the video game Crash Bandicoot our current project required four essential components:

(a) Careful analysis and identification of systematic runtime problems (Discover the bonus level)
(b) Elimination of a large number of minor approximations or errors (Observe the collection of small apples in the bonus level)
(c) A cascade of meticulous modifications and measurements (Collect a large number of small apples in the bonus level)
(d) Perseverance (Avoid traps and navigate carefully in the bonus level with a limited visible environment)

Written on October 14, 2024

Konstantin Burlachenko, PhD

Unlocking FedNL on arXiv

Links

Abstract

Results

Broader Impact

Postscriptum