交叉熵损失中隐藏着一个 Embedding 层

本文以 PyTorch 中的 torch.nn.Embedding 为 Embedding 的实现参考, 通过代码验证了 Embedding 层的一个等价运算, 并通过公式推导揭示了 (带分类层的) 交叉熵损失中隐藏着一个 Embedding 层.

Embedding 层的等价运算

Embedding 层的作用等价于: 输入的 one hot 向量与 Embedding 层权重的矩阵乘. 下面用代码验证之.

import torch
import torch.nn as nn
import torch.nn.functional as F


if __name__ == '__main__':
    num_embeddings, embedding_dim = 100, 128
    embedding = nn.Embedding(num_embeddings, embedding_dim)

    input = torch.empty(10, dtype=torch.long).random_(num_embeddings)
    output = embedding(input)

    one_hot_input = F.one_hot(input, num_embeddings)
    output_2 = one_hot_input.type(torch.float) @ embedding.weight
    print(torch.allclose(output, output_2))

交叉熵损失中的 Embedding 操作

设某样本的特征为 $\vec{x}$, 标签为 $y$, 其 one hot 编码记为 $\vec{e}_y$, 分类器权重为 $W$.

对该样本的交叉熵损失做如下变形:

$$\begin{aligned} -\vec{e}_y^T\log^\circ \operatorname{softmax} (W\vec{x}) &= -\vec{e}_y^T \left(W\vec{x} - \operatorname{LSE}\left(W\vec{x}\right) \vec{1}\right) \\ &= -\vec{e}_y^T W\vec{x} - \vec{e}_y^T \left(\operatorname{LSE}\left(W\vec{x}\right) \vec{1}\right) \\ &= -\vec{e}_y^T W\vec{x} - \operatorname{LSE}\left(W\vec{x}\right) \\ \end{aligned}$$

上面推导中用到了下面的结论:
$$\begin{aligned} \log^\circ \operatorname{softmax} (W\vec{x}) &= W\vec{x} - \log{\left(\vec{1}^T \exp^\circ \left(W\vec{x}\right)\right)}\vec{1} \\ &= W\vec{x} - \operatorname{LSE}\left(W\vec{x}\right) \vec{1} \end{aligned}$$

式中的 $\vec{e}_y^T W$, 即 $y$ 的 one hot 向量与分类层权重 $W$ 的矩阵乘, 与 Embedding 层的等效运算具有相同的形式. 所以 $\vec{e}_y^T W$ 可以转化为 Embedding 层, 其权重为分类层的权重. 综上, 交叉熵损失 (严格说是带分类层的交叉熵损失) 中相当于包含了一个 Embedding 操作.

修改历史

20240403 推导公式并发布

版权声明

署名-非商业性使用-相同方式共享 4.0 国际许可协议