引言
Quote
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation - Attention is All You Need, Section 3.4. Embeddings and Softmax1
多头注意力是什么
如何理解 Transformer 的自注意力公式
Info