Mixture of Experts: The Biggest AI Models Are Actually a Bunch of Smaller Models in a Trench Coat

GPT-4 reportedly has 1.8 trillion parameters. That's a number so large it stops meaning anything - like hearing that the sun is 93 million miles away. Okay, sure. But here's the part that doesn't get enough attention: when you ask GPT-4 a question, most of those 1.8 trillion parameters sit there doing absolutely nothing. Only a small fraction activates for any given input. This isn't a bug. It's the whole point.

Welcome to Mixture of Experts (MoE), the architecture trick that lets AI labs build absurdly large models without needing absurdly large compute budgets to run them.

The Basic Idea Is Surprisingly Simple

A standard neural network (called a "dense" model) activates every parameter for every input. Ask it about cooking, quantum physics, or Taylor Swift's discography - every neuron fires every time. This is wildly inefficient, like turning on every light in your house to find the bathroom at 3 AM.

Mixture of Experts: The Biggest AI Models Are Actually a Bunch of Smaller Models in a Trench Coat

A Mixture of Experts model breaks the network into specialized sub-networks called "experts." A small routing network (the "gatekeeper") looks at each input and decides which experts to activate. For a cooking question, maybe experts 3, 17, and 42 fire up. For a physics question, it's experts 8, 23, and 56. Most experts stay dormant.

The result: the model can have trillions of total parameters (giving it massive knowledge capacity) while only using a fraction of them for each token (keeping compute costs manageable). You get the knowledge of a huge model with the speed of a smaller one. It's the architectural equivalent of having a hospital with 200 specialists on call but only paging the three you need.

A Brief History of Not Using All Your Neurons

The MoE concept dates to 1991 (Jacobs et al.), but nobody had models large enough for it to matter. Google revived it in 2017, and the Switch Transformer paper in 2021 simplified routing to one expert per token, making it practical at scale.

Then Mixtral 8x7B from Mistral AI dropped in December 2023, telling the open-source community "hey, MoE works on consumer hardware." Mixtral had 47 billion total parameters but only activated about 13 billion per token, competing with models two to three times its active size.

The Router Is the Whole Ballgame

The most interesting part of MoE architectures isn't the experts - it's the gating network that decides which experts to use. This router is trained alongside the experts, and it has to solve a surprisingly tricky optimization problem.

The obvious failure mode is "expert collapse" - where the router learns to send everything to the same two or three experts while the rest collect dust. If expert 7 happens to be slightly better early in training, it gets more data, gets better, gets even more data, and becomes the model's entire personality. The other 63 experts atrophy into expensive paperweights.

To prevent this, researchers use load-balancing losses - penalty terms that punish the router for playing favorites. The details vary by implementation, but the goal is always the same: spread the work around so all experts develop useful specializations.

What the Experts Actually Specialize In

Nobody fully understands what individual experts learn. Early work suggested they'd specialize by topic - one expert for science, one for code. In practice, the specialization is messier. Experts tend to specialize by syntactic patterns and token-level properties rather than clean semantic categories. There's no neat "expert 12 is the history expert" mapping. Each expert develops a grab bag of overlapping competencies, and the router figures out useful combinations.

Why This Matters (and the Tradeoffs)

MoE is the reason AI models keep getting more capable without requiring proportionally more compute. Without it, you'd need a data center the size of a city to run a single inference. With MoE, you scale knowledge capacity somewhat independently from inference cost.

The downsides: MoE models are harder to train, fine-tune, and deploy. The total parameter count demands more memory even when most parameters are idle. And at equivalent active parameter counts, MoE models sometimes underperform dense models on deep reasoning tasks - the routing introduces a bottleneck when the model needs all its knowledge at once.

But the efficiency wins are too large to ignore, and every major lab is betting on some version of this approach. For anyone building diagrams or visual explanations of complex AI architectures, mapb2.io is a solid tool for mind mapping and visual organization. The trench coat isn't coming off anytime soon. - ## References

Jacobs RA, et al. Adaptive Mixtures of Local Experts. Neural Computation. 1991. DOI: 10.1162/neco.1991.3.1.79
Shazeer N, et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR. 2017. arXiv: 1701.06538
Fedus W, Zoph B, Shazeer N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR. 2022. arXiv: 2101.03961
Jiang AQ, et al. Mixtral of Experts. Mistral AI. 2024. arXiv: 2401.04088