AI is advancing at an unprecedented pace, with Mixture of Experts (MoE) models being one set of model architectures at the forefront of this revolution. These architectures enable breakthroughs in efficiency and scalability by leveraging a modular design where only a subset of specialized “expert” networks are activated for each input. MoE architectures have become a cornerstone in building ultra-large-scale models like GLaM and Switch Transformers.

Mixture of Experts (MoE) is an advanced machine learning architecture that lately has gained significance, particularly in the realm of #LLMs (large language models) and NNs (neural networks). In talking with many people about AI, I’ve found that MoE as a topic comes up often, with many folks either not understanding what it is or their understanding of it being incorrect.

With recent announcements of trillion-parameter models and announcements from Microsoft, OpenAI, and Google, understanding MoEs is more important than ever. More recently, DeepSeek v3 is a great example of a model that uses MoEs to achieve state-of-the-art performance - where the language model has 671B total but only 37B activated for each token.

I this post I provide a high-level overview of MoEs, their core components, how they work and workflow. I also includes a simple toy example implementation to help grasp the core concepts.

Why Mixture of Experts?

The central motivation behind MoE stems from the tension between growing model size and limiting computational resources. As we have seen in the recent past, increasing the parameter count of a model often yields better performance, especially in domains like natural language processing (NLP) and computer vision; however, this also drastically increases the cost of both training these models and computing cost for inference of these models. This massive computing cost is at the heart of what MoEs are addressing by offering a different paradigm known as conditional computation. MoE’s activate only a small subset of specialized sub-networks (called experts) for each input or token) only to a small number of experts rather than processing it through every parameter in the network. This helps with three key aspects:

  • Increased Model Capacity: Because only a few experts are activated at a time, MoE architectures can pack many parameters (experts) without proportionally increasing the computational cost per input.
  • Specialized Sub-networks: Different experts can learn specific patterns or token-level specializations, leading to better performance.
  • Efficient Usage of Compute: MoE optimizes resources (compute) by activating only a small fraction of the entire model, leading to significant efficiency gains.

This sparse activation strategy enables constructing models with billions (or even trillions) of parameters, making MoE a scalable solution for large-scale applications. Another benefit of MoE is its capacity for specialization. Different experts can learn distinct, context-dependent processing strategies, enabling the model to cover a broad set of input variations more effectively than a monolithic architecture.

Early MoE ideas trace back to model ensembling in classic machine learning. Still, MoE extends beyond ensembling by learning a parametric “router” (gating function) that dynamically decides which experts to use. Notable works like the Sparsely-Gated Mixture of Experts showed that MoE could massively scale model size while staying computationally efficient.

Core Components of an MoE System

The fundamental building blocks of an MoE system are:

  1. Experts: Specialized sub-models, typically feed-forward neural networks.
  2. Router or Gating Network: Determines which tokens are sent to which experts.
  3. Combiner: Aggregates the outputs from the selected experts.

Experts

Each expert is typically a neural sub-network replicated multiple times within the same model. Depending on the task, these experts are often implemented as independent feed-forward networks, MLP blocks, or convolutional layers. For large-scale language models, each expert usually mirrors the structure of the feed-forward component of a Transformer block.

Since multiple experts exist in parallel, each can potentially learn to handle different token types or data distributions. Contrary to common misconceptions, experts do not necessarily correspond to human-like semantic domains (e.g., “Expert #3 = Physics”). One might assume that certain experts correspond to high-level domains such as “finance” or “medicine.” Still, it is more common for the learned expertise to be more subtle and token-level, capturing idiosyncratic patterns that are not necessarily interpretable in a straightforward semantic way. Each expert ends up specializing in token-level or feature-level patterns that are discovered during training. For instance, in NLP, one expert might specialize in syntactic structures, while another focuses on semantic relationships.

Sometimes, certain layers (like embeddings or attention blocks) are shared among all experts, and only feed-forward layers are duplicated (as in Switch Transformers). This partial sharing allows the model to keep some global representation while still having specialized processing in the experts.

Router

The router, often called the gating network, is a small module that predicts which experts should handle any given input. Its purpose is to determine which experts should handle a given input. Modern MoE designs are typically parameterized as a simple neural network (often a single linear layer + SoftMax) or a simple linear transform. The SoftMax output provides a probability distribution across the experts, indicating which should be “activated” for each input.

The router reads the input representation (e.g., the token embedding in an NLP model) and produces a probability distribution over the experts, typically with a SoftMax function. We then select the top-k experts (e.g., top-1, top-2) based on these probabilities - for each token, known as Top-k gating.

Practical implementations often limit the number of tokens an expert can process per batch (known as Expert Capacity). If too many tokens route to the same expert, some tokens may get dropped or rerouted, leading to training instability. This capacity limit helps prevent any single expert from monopolizing the model’s processing and prevents load imbalances.

The gating network is trained jointly with the experts through back-propagation. As we outlined above, the gating process introduces discrete decisions into the computational graph, which can hamper backpropagation. To help counter this, additional techniques such as adding a small amount of noise to the logits (Noisy Top-k Gating) or using Soft MoE — are employed to smooth out these discrete selections, keep training stable, help provide smoother gradients, and encourage balanced expert utilization. Additional mechanisms like Expert Capacity limit how many tokens each expert can process, preventing load imbalances in which a single expert might receive most tokens.

Combiner

Once the chosen experts have computed their outputs for a given token, these outputs must be aggregated to produce a single vector that feeds into subsequent layers. The MoE architecture achieves this through a combiner, which typically performs a weighted sum of the experts’ outputs, using the gating probabilities as weights. In top-k gating, if $ k $ experts were activated, each expert’s output is multiplied by its corresponding probability from the router. The combiner then sums or otherwise fuses the results to form the token’s transformed representation. This consolidated output is passed into the rest of the model, such as attention blocks or additional Transformer layers.

In “Soft MoE” variants, we might use a soft combination, letting tokens pass to all experts but with different fractional weights—alleviating some routing discontinuities at the cost of higher computation.

High-Level Workflow

  1. Input is fed into the gating network, which produces a probability distribution (or scores) over experts.
  2. The top-k experts are activated for each input (or token, in the case of language models).
  3. The selected experts process the input in parallel.
  4. A combiner fuses the experts’ outputs into a single vector.
  5. The model produces the final output, which can then feed into other layers or tasks.

Below is a simple flow diagram that shows how to visualize this:

flowchart TD
    Input_Tokens --> Gating_Network
    Gating_Network --> Expert_1
    Gating_Network --> Expert_2
    Gating_Network --> Expert_3
    Expert_2 --> Combiner
    Expert_3 --> Combiner
    Combiner --> Final_Output
  

The MoE module often appears in LLMs where a standard feed-forward layer normally would reside. For each layer designated as an MoE, the process begins by sending the hidden representations of each token to the router. The router computes gating probabilities across all experts, ranks the experts by these probabilities, and picks the top few. Each selected expert applies its transformation to the token’s hidden state. Finally, the combiner merges these expert outputs with a weighted sum.

During training, the experts and the router are updated jointly through backpropagation. However, discrete gating can make gradient flow tricky since the top-k selection is not inherently differentiable. In practice, noise injection (Noisy Top-k Gating) or methods like Soft MoE can help approximate continuous gradients, ensuring that even experts with lower gating probabilities receive occasional training signals.

In addition, to avoid a scenario where one or two experts monopolize all tokens, an auxiliary load-balancing loss is often introduced to encourage more uniform usage. This might take the form of a penalty term that grows when the variance in expert usage is high, incentivizing the router to distribute tokens more evenly.

This mechanism occurs for every token in every layer designated as an MoE layer, which is why load balancing is so critical—without careful design, certain experts can receive far more tokens than others.

What is the Difference between an MoE and an Ensemble?

It’s easy to confuse MoE with model ensembling - where multiple independently trained models vote or average predictions. MoE differs in a few critical ways:

  • Dynamic and data-dependent routing: In classical ensembles, each model sees the same input, and a meta-learner or a simple average produces the final result. MoE, in contrast, uses a router that decides different subsets of the input for each expert. This dynamic routing allows MoE to specialize in different patterns or tokens.
  • Single model training: MoE typically trains all experts jointly in one go. Each expert does not have a separate training pass; they’re part of the same computation graph, sharing some parameters (like embeddings) and learning together. This is in contrast to ensembles, where each model is trained independently.
  • Fine-grained token specialization: Different tokens in the same sentence might get routed to different experts, enabling extremely fine-grained specialization. This is impossible in a traditional ensemble, where each model sees the entire input.

Load Balancing Experts and Training Pitfalls

A significant challenge in training MoE models is ensuring the balanced utilization of experts. Certain experts may become overburdened without proper load balancing, while others remain underutilized, leading to inefficient training and suboptimal performance. Gating can become highly imbalanced early in training, favoring a few experts. Common solutions for Load Balancing:

  • Auxiliary Load Balancing Losses: Adding a regularization term to the loss function encourages the gating network to distribute inputs evenly across all experts.
  • Top-k Randomization: Instead of always selecting the top-k experts with the highest gating probabilities, randomizing the selection among the top candidates can prevent overloading.
  • Expert Capacity Constraints: Limiting the number of tokens an expert can process at a time can help ensure all experts are used during training.

Over time, load-balancing losses/techniques help the distribution even out, but the model can remain fragile without careful hyperparameter tuning. Expert capacity is another important design choice. Since top-k selection may route too many tokens to the most popular experts, a capacity limit ensures each expert processes no more than a certain maximum number of tokens in one forward pass. The remaining tokens must be dropped or re-routed to other experts if an expert is at capacity. Both approaches come with trade-offs: dropping tokens entirely can lead to data inefficiency, whereas re-routing can add complexity and undermine the sparsity benefits that MoE aims to provide.

Load balancing - ensuring experts share the training load - remains one of the biggest technical hurdles with MoEs. Early in training, the gating network might discover that routing most tokens to the same expert (or a small number) might yield acceptable results, leaving other experts underutilized and effectively “dead.” This imbalance leads to suboptimal solutions - if only a few experts get trained on all tokens, you lose the advantages of specialization.

To mitigate this, MoE architectures often introduce additional techniques that nudge the model toward more even usage. One such technique is the auxiliary load-balancing loss. By monitoring how frequently each expert is selected, the model can be penalized if certain experts remain underutilized. This can be thought of as an additional penalty term that measures how evenly tokens (or batches) are distributed across experts.

Another common approach is Noisy Top-k Gating, which injects a small amount of learnable or fixed noise into the gating logits before the softmax, making the gating probabilities slightly more random. This randomness allows less-popular experts to occasionally receive tokens, which can help them develop more useful specializations over time. The gating network thus learns not only to minimize the primary task loss but also to spread tokens more uniformly.

Sparsity in Mixture of Experts (MoE) Models

Sparsity is one of MoE’s most valuable contributions to model efficiency. By only activating a small fraction of the total parameters for each input, the model maintains a much lower compute footprint than a dense model of the same overall size. This efficiency is crucial for scaling; while a trillion-parameter dense model may be prohibitively expensive to train and deploy, a trillion-parameter MoE model that only activates 1% of those parameters simultaneously becomes significantly more tractable.

That said, implementing sparsity at scale often requires specialized infrastructure. Frameworks like GShard or Mesh-TensorFlow are designed to handle data and model parallelism necessary for distributing the experts across GPU clusters. The overhead of routing tokens to the correct devices can become significant if the system is not carefully optimized. Researchers have also explored alternative gating mechanisms, such as Soft MoE, which approximates selection by routing every token to all experts in a soft, weighted fashion. While this approach can mitigate the fragility of discrete gates, it naturally increases computation since more experts perform computations at once.

Sparcity in MoE models offers several key advantages:

  • Computational Efficiency: Sparsity dramatically reduces the number of FLOPs required to process each task.
  • Scalability: The sparse activation of experts enables MoE models to scale to a large number of experts without a corresponding linear increase in computational and memory requirements.
  • Increased Model Capacity: Sparsity allows MoE models to increase their overall parameter count and model capacity without significantly increasing the computational cost during training or inference.
  • Memory Efficiency: Operating sparsely, MoE models require less memory for activations and parameters.
  • Specialized Processing: Sparsity enables the model to route different inputs to the most relevant experts, allowing for more specialized and efficient processing of diverse inputs.

Practical applications of MoE

MoE architectures have already demonstrated clear benefits in many areas. Microsoft’s Z-code model (Machine Translation), for instance, leverages MoE to handle multilingual translation tasks at a massive scale, and Google’s Switch Transformers showed that sparse activation can reach higher quality at lower training cost than dense baselines on benchmarks such as GLUE and SuperGLUE. In computer vision, MoE modules have been integrated into Vision Transformers (V-MoEs) to achieve better image classification and detection accuracy, with each expert focusing on different aspects of the image representation. In multimodal learning, the capacity to handle diverse data types—such as text, images, and audio—makes MoE a natural fit because experts can adapt to different modalities or different subproblems within a single modality.

In the context of LLMs, systems like ChatGPT, Claude, and Gemini can benefit from MoE by leveraging different experts for different topics or query types - though specifics are often proprietary and not shared. MoE is particularly suited to multi-modal tasks involving text, images, and audio, as experts can specialize in different modalities or sub-modalities. This is valuable for text-to-image generation or video understanding.

Several emerging directions continue to push MoE research forward. Soft MoE (Zuo et al., 2022) is an example aiming to produce a fully differentiable version of sparsely gated Transformers. Another is Parameter-Efficient Sparsity Crafting (PESC), which seeks to retrofit existing dense models into a sparse MoE design without retraining from scratch. These innovations reflect ongoing efforts to refine the balance between sparse efficiency, training stability, and model reliability.

In production, deploying a large MoE model requires carefully coordinating hardware resources, data pipelines, and load-balancing techniques. Training an MoE system may involve more hyperparameters than a comparable dense model, including the number of experts, gating softmax temperature, top-k value, load-balancing penalty weights, and expert capacity. These factors can significantly affect performance, convergence speed, and final accuracy. When scaling an MoE model across multiple GPUs, designers must pay attention to network communication overhead. Token-based routing leads to collective operations that can become bottlenecks if not carefully optimized.

Despite these complexities, MoE’s flexibility and computational cost savings make it a compelling choice for handling highly varied or large-scale tasks. However, Fine-tuning MoE models can be more delicate than dense models because the gating distributions or specialized experts may not adapt smoothly to a new domain without carefully applied load-balancing strategies. There can also be interpretability challenges since the model’s internal “expert structure” does not always map neatly to skills.

Challenges and Considerations

We have touched on most of these in this blog post, but it is helpful to outline the key issues and considerations to be mindful of when using MoE-based models:

  • Complexity: MoE models are significantly more complex (compared to traditional neural networks) and require substantial computational resources for training and inference.
  • Training Instability: MoE models can suffer from training instability due to the discrete nature of expert selection.
  • Load Balancing: Proper load balancing among experts is crucial for efficiently using model capacity and optimal performance.
  • Computational Overhead: The gating mechanism introduces additional computational overhead, potentially impacting training and inference times.
  • Interpretability Issues: The dynamic routing of inputs makes interpreting how MoE models arrive at their decisions challenging.
  • Hyperparameter Sensitivity: MoE models have several hyperparameters that must be tuned for optimal performance.

Let’s define these mathematically before discussing practical considerations.


Mathematical Formulation of MoE

Let $ x \in \mathbb{R}^d $ denote an input vector (a hidden representation from a preceding layer). We assume the system has $ N $ experts, with each expert $ E_i $ parameterized as a function $ E_i: \mathbb{R}^d \rightarrow \mathbb{R}^m $. The gating network $ G $ takes the same input $ x $ and outputs a vector in $\mathbb{R}^N$—essentially, a “score” or “weight” for each of the $ N $ experts. Finally, a combiner function $ C $ merges the experts’ outputs into a single output vector $ y \in \mathbb{R}^m $.

Formally, we can write:

  1. Input: $ x \in \mathbb{R}^d $
  2. Experts: $ E_i: \mathbb{R}^d \rightarrow \mathbb{R}^m \quad \text{for}\ i = 1, \ldots, N $
  3. Gating Network: $ G: \mathbb{R}^d \rightarrow \mathbb{R}^N $
  4. Combiner: $ C: \mathbb{R}^{N \times m} \rightarrow \mathbb{R}^m $

Top-k Gating

Most contemporary MoE implementations use top-k gating, which activates only the $ k $ experts with the highest gating scores. In this scenario, the summation is performed only over those top-$ k $ indices. If we denote $\text{top-k}(G(x))$ as the set of indices corresponding to the $ k $ largest values of $ G(x) $, then

$ y = C\Bigl(\sum_{i \in \text{top-k}(G(x))} G(x)_i ,\cdot, E_i(x)\Bigr). $

By pruning all but the top-$ k $ experts per input, this design enforces sparse activation: each input (or token) only “touches” $ k $ out of $ N $ experts at a time. This approach significantly reduces the computational load relative to using all $ N $ experts for every input.

Dense Gating Formulation

This refers to a version of MoE architecture in which all experts contribute to the final output for each input rather than filtering out all but the top-k experts. Here, the gating network assigns continuous weights to every expert, aggregating each expert’s weighted output into a final result. There is no discrete selection to zero out certain experts based on gating. Fundamentally, this is the opposite of top-K gating.

In the simplest version of MoE, where the gating network’s output is a set of continuous weights, the forward pass for one input $ x $ can be written as:

$ y = C\Bigl(\sum_{i=1}^N G(x)_i ,\cdot, E_i(x)\Bigr). $

Here, $ G(x)_i $ is the $i$-th component of the gating network’s output for $ x $ and represents the weight (or probability) assigned to expert $ i $. Intuitively, if $ G(x)_i $ is large, expert $ i $ contributes more to the final output $ y $. The function $ C $ often takes the form of a simple weighted sum or concatenation-and-projection step, depending on the specific design.

Load Balancing and Expert Monopolization

As we saw earlier, MoE architectures introduce an *auxiliary load-balancing loss*to load valance across all experts. This can be thought of as an additional penalty term that measures how evenly tokens (or batches) are distributed across experts. One common strategy is penalizing the variance in expert usage or encouraging each expert to receive roughly an equal proportion of examples. The gating network thus learns not only to minimize the primary task loss but also to spread tokens more uniformly.

Mathematically, a typical load-balancing loss might look like:

$ \mathcal{L} _ {\text{balance}} = \lambda \sum_{i=1}^N \Bigl(\frac{f_i}{\sum_j f_j} - \frac{1}{N}\Bigr)^2, $

where $ f_i $ is the total number of tokens assigned to expert $ i $ in a minibatch (or the sum of gating probabilities if you’re using a continuous measure), and $ \lambda $ is a hyperparameter controlling the strength of this penalty. This ensures the model is incentivized to explore and train all experts over time.


Example PyTorch Implementation

Here’s a simplified Python implementation of an MoE model using PyTorch:

import torch
import torch.nn as nn

class Expert(nn.Module):
    def __init__(self, input_size, output_size):
        super(Expert, self).__init__()
        self.layer = nn.Linear(input_size, output_size)
    
    def forward(self, x):
        return self.layer(x)

class GatingNetwork(nn.Module):
    def __init__(self, input_size, num_experts):
        super(GatingNetwork, self).__init__()
        self.layer = nn.Linear(input_size, num_experts)
        self.softmax = nn.Softmax(dim=1)
    
    def forward(self, x):
        return self.softmax(self.layer(x))

class MixtureOfExperts(nn.Module):
    def __init__(self, input_size, output_size, num_experts, top_k=1):
        super(MixtureOfExperts, self).__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.experts = nn.ModuleList([Expert(input_size, output_size) for _ in range(num_experts)])
        self.gating = GatingNetwork(input_size, num_experts)
    
    def forward(self, x):
        batch_size = x.size(0)
        gating_probs = self.gating(x)
        topk_vals, topk_inds = torch.topk(gating_probs, self.top_k, dim=1)
        expert_outputs = torch.zeros(batch_size, self.top_k, output_size, device=x.device)
        for i in range(self.top_k):
            inds = topk_inds[:, i]
            outputs_for_expert = torch.stack([self.experts[inds[b]](x[b].unsqueeze(0)) for b in range(batch_size)])
            expert_outputs[:, i, :] = outputs_for_expert.squeeze(1)
        topk_vals_expanded = topk_vals.unsqueeze(-1)
        weighted_sum = expert_outputs * topk_vals_expanded
        combined_output = weighted_sum.sum(dim=1)
        return combined_output

# Example Usage
input_size = 10
output_size = 5
num_experts = 4
top_k = 2
model = MixtureOfExperts(input_size, output_size, num_experts, top_k)
sample_input = torch.randn(8, input_size)
output = model(sample_input)
print("Output shape:", output.shape)

When you run this code, you should see the output shape printed as torch.Size([8, 5]). This confirms that the model routes each input to the most relevant experts, processes the input through those experts, and combines their contributions into a unified output.

Code Explanation

This code defines a simple MoE model with a single gating network and multiple experts. The Expert class represents a single sub-module, the expert, a linear layer that transforms the input and forms the backbone of the MoE architecture. The expert takes the input of size input_size and transforms it to output_size using a fully connected (Linear) layer. In the real world, each expert is designed to specialize in different transformations based on their training.

The GatingNetwork class is the router that computes gating probabilities for each expert and determines which experts to activate for a given input. It takes the input size and the number of experts as input and outputs a probability distribution over the experts (num_experts) using a linear layer followed by a Softmax function. Higher probability values indicate that the corresponding expert is more relevant to the input.

The MixtureOfExperts class combines the experts’ outputs based on the gating probabilities and returns the final output. It takes the input size, output size, number of experts, and the top-k value as input. The forward method computes the gating probabilities, selects the top-k experts based on these probabilities, and computes the weighted sum of the expert outputs to produce the final output. The top_k parameter controls how many experts are activated for each input.

  • The input ( x ) is passed through the gating network to produce a probability distribution over the experts: gating_probs = self.gating(x)
  • The gating network selects the indices of the top_k experts with the highest probabilities: topk_vals, topk_inds = torch.topk(gating_probs, self.top_k, dim=1)
  • For each of the selected experts, process the input:
    expert_outputs = torch.zeros(batch_size, self.top_k, output_size, device=x.device)
    for i in range(self.top_k):
        inds = topk_inds[:, i]
        outputs_for_expert = torch.stack([self.experts[inds[b]](x[b].unsqueeze(0)) for b in range(batch_size)])
        expert_outputs[:, i, :] = outputs_for_expert.squeeze(1)
    
  • Multiply each expert’s output by its gating probability and sum them to form the final output:
    topk_vals_expanded = topk_vals.unsqueeze(-1)
    weighted_sum = expert_outputs * topk_vals_expanded
    combined_output = weighted_sum.sum(dim=1)
    
    • The topk_vals (probabilities) are weights for the corresponding expert outputs.

The model processes a batch of inputs and returns the combined output with a shape matching (batch_size, output_size). For the example above, Output shape: torch.Size([8, 5]) confirms that the model routes each input to the most relevant experts, processes the input through those experts, and combines their contributions into a unified output.

Code Dependencies

To run this code, you’ll need the following dependencies:

  1. Python 3.6 or later (preferably 3.10 or higher)
  2. PyTorch 1.0 or later

Save the code to a file, e.g., moe_example.py, and run it:

python moe_example.py

I used Conda, which I prefer for managing Python environments. If you’re using a virtual environment, you can adapt the installation commands accordingly. For Conda, you can create a new environment and install PyTorch using the following steps.

Start by creating a new conda environment with PyTorch dependencies. In your terminal, execute the following commands:

# Create a new conda environment named "moe_env"
conda create -n moe_example python=3.10 -y

# Activate the environment
conda activate moe_example

Install PyTorch and necessary dependencies; adjust the CUDA version based on your system’s GPU configuration. You can omit the pytorch-cuda package using a CPU-only setup.

# For CPU-only:
conda install pytorch torchvision torchaudio cpuonly -c pytorch -y

# For GPU (use appropriate CUDA version, e.g., 11.7):
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -y

# Install additional dependencies if needed
pip install numpy

This example demonstrates the basic structure of an MoE model, including the experts, gating network, and the MoE module that combines them. Of course, this is a toy version that helps understand the basic construct and does not include advanced features like load balancing or sparsity, etc.

Conclusion

Mixture of Experts offers a compelling framework for scaling neural networks and managing the trade-offs between model size, computational cost, and performance. By selectively activating only a subset of parameters for each input, MoE allows researchers and practitioners to build models with enormous capacity without incurring a proportionate computational penalty. Given the additional interest in inference optimization for LLMs and broadly with Transformer-based architecture, we expect to see further innovations and applications for MoE.


References