Deck Chairs and Fiddles: DeepSpeed

Monday, November 8, 2021

DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:

Extreme scale: Using current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth. 1-bit Adam/1-bit LAMB reduce communication volume by up to 5x while achieving similar convergence efficiency to Adam/LAMB, allowing for scaling to different types of GPU clusters and networks.

Below we provide a brief feature list, see our detailed feature overview for descriptions and usage.

Distributed Training with Mixed Precision
- 16-bit mixed precision
- Single-GPU/Multi-GPU/Multi-Node
Model Parallelism
- Support for Custom Model Parallelism
- Integration with Megatron-LM
Pipeline Parallelism
- 3D Parallelism
The Zero Redundancy Optimizer (ZeRO)
- Optimizer State and Gradient Partitioning
- Activation Partitioning
- Constant Buffer Optimization
- Contiguous Memory Optimization
ZeRO-Offload
- Leverage both CPU/GPU memory for model training
- Support 10B model training on a single GPU
Ultra-fast dense transformer kernels
Sparse attention
- Memory- and compute-efficient sparse kernels
- Support 10x longer sequences than dense
- Flexible support to different sparse structures
1-bit Adam and 1-bit LAMB
- Custom communication collective
- Up to 5x communication volume saving
Additional Memory and Bandwidth Optimizations
- Smart Gradient Accumulation
- Communication/Computation Overlap
Training Features
- Simplified training API
- Gradient Clipping
- Automatic loss scaling with mixed precision
Training Optimizers
- Fused Adam optimizer and arbitrary torch.optim.Optimizer
- Memory bandwidth optimized FP16 Optimizer
- Large Batch Training with LAMB Optimizer
- Memory efficient Training with ZeRO Optimizer
- CPU-Adam
Training Agnostic Checkpointing
Advanced Parameter Search
- Learning Rate Range Test
- 1Cycle Learning Rate Schedule
Simplified Data Loader
Curriculum Learning
- A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
- Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
- Complementary to many other DeepSpeed features
Performance Analysis and Debugging
Mixture of Experts (MoE)

https://github.com/microsoft/DeepSpeed

https://www.deepspeed.ai/

Deck Chairs and Fiddles

Monday, November 8, 2021

DeepSpeed

No comments:

Post a Comment