Scalable Pretraining of Large Mixture of Experts Models on Aurora

Executive Summary

Pretraining large language models (LLMs) from scratch is computationally demanding, requiring immense resources. Utilizing the Aurora supercomputer, an ExaScale machine, allows for pretraining at an unprecedented scale. The developed library, Optimus, facilitates efficient training of mixture of experts (MoE) models, achieving significant scalability and performance improvements.

The Architecture / Core Concept

The implementation focuses on pretraining MoE language models using the Aurora supercomputer's vast array of 127,488 Intel PVC GPU tiles. The `Optimus` library underpins this effort by supporting sophisticated model training techniques and resource allocation strategies necessary for training models at this scale. The concept leverages the mixture of experts architecture, which enables efficient model scaling by distributing computational workloads across various expert sub-networks, each specialized for specific tasks within the model.

Key Innovations

Optimus Training Library: A comprehensive library tailored for scalable and reliable pretraining, incorporating custom GPU kernels and an EP-Aware sharded optimizer.
Mixture of Experts (MoE) Architecture: Utilizes multiple expert sub-networks, each tasked with various computation responsibilities, optimizing both performance and resource usage.
Scalable Infrastructure: Achieved significant scaling efficiency, with tests on the Mula-220B-A10B model showing 90% scaling efficiency at 12,288 GPU tiles.

Implementation Details

The core implementation involves custom computation kernels and a sharded optimizer, crucial for the effective scaling of MoE models. Below is a synthesized example code illustrating the sharded optimizer logic:

class EP_AwareShardedOptimizer:
    def __init__(self, model_parameters, learning_rate, sharding_strategy):
        self.parameters = model_parameters
        self.lr = learning_rate
        self.strategy = sharding_strategy

    def step(self):
        for shard in self.strategy.shards(self.parameters):
            # Custom gradient updates for each shard
            shard.apply_gradients(self.lr)

# Usage example
opt = EP_AwareShardedOptimizer(model.parameters(), learning_rate=0.001, sharding_strategy=SomeShardingStrategy())
opt.step()

Engineering Implications

Training models on such a grand scale poses unique challenges related to computational throughput, energy consumption, and fault tolerance. The integration of custom kernels and sharding mechanisms reduces latency and improves throughput. However, this scale also necessitates complex coordination and error handling protocols.

Trade-offs:

Scalability: High scalability enables rapid experimentation with very large models.
Cost: Operating on an ExaScale machine is resource-intensive, translating to higher operational costs.
Complexity: Increased complexity in managing distributed systems and ensuring system reliability.

My Take

The undertaking represents a significant milestone in large-scale LLM pretraining, leveraging the power of the Aurora supercomputer for unprecedented scale and efficiency. The implications for AI research are substantial, allowing for the exploration of models with complex architectures. While the costs and complexity are high, the potential breakthroughs in understanding language modeling could justify these investments, paving the way for even more scalable and efficient model designs. This venture indeed sets a foundational precedent for future supercomputing collaborations in the AI domain.

Scalable Pretraining of Large Mixture of Experts Models on Aurora

Executive Summary

The Architecture / Core Concept

Key Innovations

Implementation Details

Engineering Implications

Trade-offs:

My Take

Share this article

Written by James Geng

Related Articles

Mixture of Experts (MoE)

Understanding Hallucination in Large Language Models: Architectural and Data Perspectives

Visual Graph Scaffolds in Large Language Models