AI is a field drowning in buzzwords. As soon as you think you’re up to date, a new nonsense-sounding term comes along. The latest one is mixture of experts or MoE. It’s an AI architecture that allows large language models to run more efficiently, though training them takes more effort.
I’ll dig into the nitty gritty in a moment, but it’s worth noting that the majority of the most powerful open models like DeepSeek V3 and DeepSeek R1, Meta Llama 4 Maverick and Scout, and Qwen 3 235B now use a mixture-of-experts architecture, and there have been persistent rumors that OpenAI’s GPT models have used it since GPT-4.
Unfortunately, corporate secrecy being what it is, we have no confirmation from the likes of OpenAI, Anthropic, and Google that they’re using the architecture. But it seems incredibly likely that they are, or soon will be.
With that caveat out of the way, let’s look at mixture of experts—what it is, what it’s good for, and what the downsides are.
Table of contents:
What is mixture of experts?
A mixture-of-experts (MoE) model is any machine learning model composed of multiple smaller specialized models (experts) and a gating or routing network to select which expert is used for any given input.
Most of the time, if you see MoE, it’s referring to large language models (LLMs) and multimodal models, just because they’re the most popular models right now. But any AI model can theoretically use the structure.
An easy way to think of it is that a regular (or dense) AI model has one super intelligent expert, while a MoE model is a team with multiple more specialized experts, plus a manager who decides which experts solve which problems.
Now let’s dig a little more into how mixture of experts works.
How a mixture-of-experts model works
To understand how a MoE model works, it’s easiest to first look at how a dense model (the other option) works.
Dense models, explained
In a typical LLM like Llama 3 70B, every token in your prompt activates every parameter in the model. Say you enter a short prompt like, “what is a llama?”. That has five tokens, so all 70 billion parameters are activated five times. While it isn’t much individually, each parameter requires a small amount of compute and a small amount of electricity.
The catch is that the vast majority of those parameters are giving the AI equivalent of a shrug. All the parameters to do with complex math, generating computer code, and French are getting activated, but they just aren’t relevant to the prompt, so they don’t do very much.
Dense models, as they’re called, make sense for smaller models because they’re relatively simple. There’s no additional computational overhead in deciding what groups of parameters to activate or how to divide up training load. But as dense models get larger, the amount of parameters being activated unnecessarily grows. Llama 3 405B costs more to run because every token activates almost six times as many parameters that have to run on GPUs in a data center.
Mixture-of-experts, explained
Mixture-of-experts models have multiple sub-models or experts, though the exact architecture can vary quite a bit.
The whole model is seldom split into separate systems. Instead, there are a number of mixture-of-experts layers in the neural network, and each of these is split into a number of experts. For example, Llama 4 Scout has 16 experts per layer, while Llama 4 Maverick has 128 experts per layer.
Instead of activating every parameter for every token, a routing function determines which expert (or number of experts) each token gets sent to in the mixture-of-expert layers. So, although Llama 4 Maverick has 400 billion total parameters and Llama 4 Scout has 109 billion parameters, they both only activate 17 billion parameters for each token. This is called sparsity, and you’ll sometimes see MoE models referenced as “sparse mixture-of-expert models” as a result.
While this might sound confusing (and it is), it gets a lot simpler when you realize that the “experts” aren’t math or science or Shakespearean literature geniuses. Instead, each expert deals with things like punctuation, verbs, visual descriptions, proper names, and conjunctions. If you think about our prompt, “what is a llama?” it makes sense that one expert system would deal with the ? while another would handle the word llama.
Compared with dense models, mixture-of-experts models are theoretically able to provide better performance for lower compute costs. There are bucketloads of qualifications that come with that (and I’ll get to that in a moment), but all being as equal as it can be, Llama 4 Maverick is cheaper to run than Llama 3 70B—and it’s more powerful.
Mixture of experts pros and cons
As with literally every AI development, there are pros and cons to the mixture-of-experts architecture. The two main things to consider are the training of the models and the running of the models.
Training mixture-of-experts models
The big problem with mixture-of-experts models is that they’re significantly more complicated to train than dense models. While they have all the same data requirements and similarly rely on transformers, pre-training, fine-tuning, and all the usual AI development strategies, there are a few complicating factors:
-
Mixture-of-expert models tend to be larger. Llama 4 Behemoth has two trillion total parameters, and even the smallest Llama MoE, Scout, has 109B parameters. Training models that large takes a huge amount of compute and time.
-
It’s challenging to train the routing function and ensure that each expert is trained appropriately. There are now widely accepted strategies, but there’s a reason that MoE models are only just becoming more widely used.
Running mixture-of-experts models
Running or inferencing mixture of experts is faster and cheaper than inferencing dense models, with a couple of caveats.
For two models with the same total parameter count, inferencing the MoE model will take less compute than the dense model as fewer parameters are active. For two models with the same active parameter count, inferencing the MoE model will likely be slower because of the additional overhead of the routing function—but the MoE model will be far more powerful.
While that sounds great, the tradeoff is that the whole model has to be stored in memory even though it isn’t activated for every token. For models like Llama 4 Maverick and Scout, that means server class GPU clusters with hundreds of GB of VRAM—despite the fact that they only have 17 billion active parameters. Llama 3 70B, on the other hand, can run on a high-end workstation or even a specced out MacBook Pro.
Read more about mixture-of-experts models
I’ve tried to give a sensible and coherent overview of mixture-of-experts models for the AI-curious, but this is far from a deep dive.
If you want to get deep into the details and understand how dense and MoE models are actually developed and designed, here are some resources that are worth checking out.
More MoE to come?
While mixture-of-experts models have been around for the past couple of years, they’re now starting to have their moment. As AI models get more powerful, the tradeoffs involved in developing a model with MoE architecture start to make more sense—especially since they can be cheaper to run. The additional upfront workload is offset by their inference efficiency.
I suspect that open models will continue to shift to mixture of experts, and that proprietary models will start to confirm that they’re also using it. The only situation they don’t make sense for is when you’re running a model locally on a low-powered device, like a smartphone or laptop. In that case, small language models are going to be far more useful.
Related reading: