We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results