Reading Notes: Outrageously Large Neural Networks-The Sparsely-Gated Mixture-of-Experts Layer
Motivations
The model’s performance is related to the model’s parameter. The bigger the model is, the more performant it will be. However, the computational cost also increases. To mitigate this problem, various forms of conditional computation have been proposed to increase model performance without a proportional increase in computational costs1.
Today I would like to share the Sparsely-Gated Mixture-of-Experts Layer (MoE) as proposed in this paper.1

MoE architecture
There are $n$ experts in the MoE layer (denoted as $E_1, E_2, …, E_n$) and they are controlled by a gating network $G$. The output of the gating network $G$ is a vector with length $n$.