Resilient Distributed Training of Large Models

 

 

 

Oobleck provably guarantees fault tolerance for up to f failures without sacrificing training throughput by creating pipeline templates that can use all available training resources all the time.

Large language models (LLMs) and generative AI (GenAI), are taking the world by storm. These large deep neural network (DNN) models are predominantly trained in large GPU clusters on ever-growing datasets using a combination of parallelism techniques, including data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), and combinations thereof. Synchronous distributed training of large DNNs experience significant throughput loss upon even a single failure: all GPUs must idle until the failure has been mitigated. State-of-the-art solutions for GenAI training do not provide systematic fault tolerance guarantees. They also suffer from large throughput loss during either training or recovery, and they perform worse with increasing model size. In this project, we aim to develop Oobleck to enable resilient distributed training of large GenAI models with consistently high throughput even in the presence of failures. Oobleck takes a planning-execution co-design approach, where it will first generate a set of pipeline templates and instantiates (f + 1) logically equivalent, physically heterogeneous pipeline replicas to tolerate any f concurrent failures. During execution, it will then rely on already-replicated model states across the replicas to provide fast recovery. We will evaluate Oobleck on large LLMs and GenAI models like GPT-3 to show its effectiveness.

Other Researchers

Mosharaf Chowdhury (Computer Science and Engineering)