BurTorch (Backpropagation Ultrafast Runtime)

The new paper BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems presents the design choices behind one of the fastest and memory-efficient backpropagation implementations on CPU.

Results

The new paper by Konstantin Burlachenko and Peter Richtárik, present BurTorch and compares it with industry best-practice solutions for automatic differentiation in Machine Learning, considering various operation modes, programming language APIs, and usage across multiple desktop operating systems.

The list of frameworks includes the following:

PyTorch: https://pytorch.org/
JAX: https://docs.jax.dev/en/latest/
TensorFlow: https://www.tensorflow.org/
TensorFlow Lite: https://ai.google.dev/edge/litert
Autograd: https://github.com/HIPS/autograd
Micrograd: https://x.com/karpathy/status/1803963383018066272
Apple MLX: https://github.com/ml-explore/mlx

Experiments were conducted on physically distinct computational devices and across major desktop operating systems:

For small compute graphs, BurTorch outperforms best-practice solutions by up to x2000 in runtime and reduces memory consumption by up to x3500.

For a miniaturized GPT-3 model (Brown et al., 2020), BurTorch achieves up to a x20 speedup and reduces memory up to x80 compared to PyTorch on CPU.

Extra Links

The arXiv preprint: https://arxiv.org/abs/2503.13795
Podcast generated via NotebookLM in an entertaining format:
- (i) online: https://www.podbean.com/eas/pb-en4mf-184b9bf
- (ii) offline: https://burlachenkok.github.io/podcasts/burtorch-generated-interview.mp3
The Twitter post: https://x.com/burlachekok/status/1902295533139755122
The LinkedIn post: https://www.linkedin.com/posts/burlachenkok_burtorch-backpropagation-ultrafast-runtime-activity-7308035001575505920-w0lB

Abstract

In this work, we introduce BurTorch, a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations through an exceptionally efficient CPU-based backpropagation (Rumelhart et al., 1986; Linnainmaa, 1970) implementation. Although modern DL frameworks rely on compiler-like optimizations internally, BurTorch takes a different path. It adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research. By eliminating the overhead of large frameworks and making efficient implementation choices, BurTorch achieves orders-of-magnitude improvements in performance and memory efficiency when computing the gradient of a function on a CPU. BurTorch features a compact codebase designed to achieve two key goals simultaneously. First, it provides a user experience similar to script-based programming environments. Second, it dramatically minimizes runtime overheads. In large DL frameworks, the primary source of memory overhead for relatively small computation graphs is due to feature-heavy implementations. We benchmarked BurTorch against widely used DL frameworks in their execution modes: JAX (Bradbury et al., 2018), PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016); and several standalone libraries: Autograd (Maclaurin et al., 2015), Micrograd (Karpathy, 2020), Apple MLX (Hannun et al., 2023). For small compute graphs, BurTorch outperforms best-practice solutions by up to x2000 in runtime and reduces memory consumption by up to x3500. For a miniaturized GPT-3 model (Brown et al., 2020), BurTorch achieves up to a x20 speedup and reduces memory up to x80 compared to PyTorch.

Written on March 19, 2025