andersch.dev

<2025-03-10 Mon>
[ ai ]

Mixture of Experts (MoE)

Mixture of Experts (MoE) is a neural network architecture that aims to divide complex problems into specialized tasks handled by different expert models.

A gating network dynamically selects and combines the outputs of these experts based on input data, allowing the model to allocate computational resources efficiently.

Active parameters is a metric that refers to the number of parameters that are actively used during the processing of a single token.

MoEs:

Examples

The 2 Main Elements of an MoE

Sparse MoE layers

  • used instead of dense feed-forward network (FFN) layers.
  • MoE layers have a certain number of "experts" (e.g. 8)
  • Each expert is a neural network
  • Experts are FFNs, complex networks or MoEs themselves (i.e. hierarchical MoEs)

A gate network or router

  • Determines which tokens are sent to which expert.
  • Composed of learned parameters and is pretrained with the rest of the network
  • E.g., token "More" is sent to 2nd expert, token "Parameters" sent to 1st.
  • A token can be sent to more than one expert.
  • How to route tokens is an important design decision

mixture-of-experts-switch-transformer.png

Figure 1: We replace the dense feed forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer (light blue). The layer operates independently on the tokens in the sequence. We diagram two tokens (z; = “More” and z, = “Parameters” below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).

Resources