DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:
- Extreme scale: Using current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
- Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
- Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
- Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth. 1-bit Adam/1-bit LAMB reduce communication volume by up to 5x while achieving similar convergence efficiency to Adam/LAMB, allowing for scaling to different types of GPU clusters and networks.
Below we provide a brief feature list, see our detailed feature overview for descriptions and usage.
- Distributed Training with Mixed Precision
- 16-bit mixed precision
- Single-GPU/Multi-GPU/Multi-Node
- Model Parallelism
- Support for Custom Model Parallelism
- Integration with Megatron-LM
- Pipeline Parallelism
- 3D Parallelism
- The Zero Redundancy Optimizer (ZeRO)
- Optimizer State and Gradient Partitioning
- Activation Partitioning
- Constant Buffer Optimization
- Contiguous Memory Optimization
- ZeRO-Offload
- Leverage both CPU/GPU memory for model training
- Support 10B model training on a single GPU
- Ultra-fast dense transformer kernels
- Sparse attention
- Memory- and compute-efficient sparse kernels
- Support 10x longer sequences than dense
- Flexible support to different sparse structures
- 1-bit Adam and 1-bit LAMB
- Custom communication collective
- Up to 5x communication volume saving
- Additional Memory and Bandwidth Optimizations
- Smart Gradient Accumulation
- Communication/Computation Overlap
- Training Features
- Simplified training API
- Gradient Clipping
- Automatic loss scaling with mixed precision
- Training Optimizers
- Fused Adam optimizer and arbitrary
torch.optim.Optimizer
- Memory bandwidth optimized FP16 Optimizer
- Large Batch Training with LAMB Optimizer
- Memory efficient Training with ZeRO Optimizer
- CPU-Adam
- Fused Adam optimizer and arbitrary
- Training Agnostic Checkpointing
- Advanced Parameter Search
- Learning Rate Range Test
- 1Cycle Learning Rate Schedule
- Simplified Data Loader
- Curriculum Learning
- A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
- Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
- Complementary to many other DeepSpeed features
- Performance Analysis and Debugging
- Mixture of Experts (MoE)
https://github.com/microsoft/DeepSpeed
No comments:
Post a Comment