Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Transformer FLOPs Equation or FLOPs-per-token is one of the key attributes in determining computation budget for any transformer base LLM models. Usually in language models not all tokens and sequences require the same time or effort to accurately make a prediction. And yet, Transformer-based language models spread FLOPs uniformly across input sequences.

So can transformers learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimizing the allocation along the sequence for different layers across the model depth ?

This is what researchers from Google are trying to address with the Mixture-of-Depths (MoD) Transformer. MoD is similar to mixture-of-experts (MoE) transformers where in a router is used to choose among potential computational paths. But unlike in MoE transformers the possible choices are a standard block’s computation (i.e., self-attention and MLP) or a residual connection.

In general, MoD method enforces a total compute budget by capping the number of tokens (𝑘) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-𝑘 routing mechanism. Since 𝑘 is defined a priori, this simple procedure uses a static computation graph with known tensor sizes. Nevertheless, since the identities of the 𝑘 tokens are fluid, MoD method can expend FLOPs non-uniformly across the time and model depth dimensions. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently.

During evaluation Mixture-of-Depths (MoD) Transformer models not only match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50% faster to step during post-training sampling.

Paper : https://arxiv.org/pdf/2404.02258.pdf