sklearn 中 GMM 的 predict_proba 输出的是什么概率

引子: 这是今天下班时一位同事问的问题, 一时不知怎么作答, 遂读了一下 sklearn 的源码, 写下这篇文章.

在 sklearn 中 GMM (高斯混合模型) 是用 GaussianMixture 实现的, 其方法 predict_proba 可以得到样本属于每个分量的概率, 而我们知道通过 GMM 只能得到样本在每个分量中的概率密度, 那每个分量的概率又是什么?

于是阅读代码, 定位到 BaseMixture.predict_proba (BaseMixture 是 GaussianMixture 的基类) 这个函数, 其实现如下 (sklearn 版本为 0.23.1)

def predict_proba(self, X):
    """Predict posterior probability of each component given the data.
    """
    check_is_fitted(self)
    X = _check_X(X, None, self.means_.shape[1])
    _, log_resp = self._estimate_log_prob_resp(X)
    return np.exp(log_resp)

进而定位到 BaseMixture._estimate_log_prob_resp, 其实现如下:

def _estimate_log_prob_resp(self, X):
    """Estimate log probabilities and responsibilities for each sample.
    """
    weighted_log_prob = self._estimate_weighted_log_prob(X)
    log_prob_norm = logsumexp(weighted_log_prob, axis=1)
    with np.errstate(under='ignore'):
        # ignore underflow
        log_resp = weighted_log_prob - log_prob_norm[:, np.newaxis]
    return log_prob_norm, log_resp

其中 weighted_log_prob 是样本每个分量的的加权对数概率密度 (具体可见下一节). 为了讨论和推导简便, 只考虑一个样本, 并设 weighted_log_prob 为 $\vec{x}$, 则 log_prob_norm 为 $\operatorname{LSE}(\vec{x}) = \log \left( {\sum\nolimits_{i = 1}^n \exp \left(x_i \right) } \right)$, log_resp 为 $\vec{x} - \operatorname{LSE}(\vec{x}) \vec{1}$.

于是 BaseMixture.predict_proba 返回结果为 $\exp^\circ \left(\vec{x} - \operatorname{LSE}(\vec{x}) \vec{1}\right)$. (一些符号约定: $f^\circ(\cdot)$ 表示一元函数 $f(\cdot)$ 的按元素版本, 如 $\exp^\circ(\vec{x}) = [\exp(x_1); \cdots;\exp(x_n)]$, ; 表示矩阵元素纵向连接.) 下面对该式进行化简:

$$\begin{aligned} \exp^\circ \left(\vec{x} - \operatorname{LSE}(\vec{x}) \vec{1}\right) &= \frac{\exp^\circ \left(\vec{x}\right)}{\exp \left(\operatorname{LSE}(\vec{x})\right)}\\ &= \frac{\exp^\circ \left(\vec{x}\right)}{\exp \left(\log \left( \sum\nolimits_{i = 1}^n \exp \left(x_i \right) \right)\right)}\\ &= \frac{[\exp(x_1); \cdots;\exp(x_n)]}{\sum\nolimits_{i = 1}^n \exp \left(x_i \right) }\\ &= \operatorname{softmax}(\vec{x}) \end{aligned} $$

所以: GaussianMixture.predict_proba 输出的概率是样本每个分量的的加权对数概率密度经过 softmax 函数归一化后的结果.

`weighted_log_prob` 的计算过程

weighted_log_prob 是通过 BaseMixture._estimate_weighted_log_prob 函数得到的, 其实现如下:

def _estimate_weighted_log_prob(self, X):
    """Estimate the weighted log-probabilities, log P(X | Z) + log weights.
    """
    return self._estimate_log_prob(X) + self._estimate_log_weights()

其中
1) GaussianMixture._estimate_log_prob 得到: 每个分量的对数概率密度 (计算较复杂, 不再赘述).
2) GaussianMixture._estimate_log_weights 得到: 每个分量的权重的对数.

sklearn 中 GMM 的 predict_proba 输出的是什么概率

weighted_log_prob 的计算过程

`weighted_log_prob` 的计算过程