混淆矩阵及其衍生指标的数学公式和代码实现

1. 混淆矩阵及其衍生指标

混淆矩阵 (confusion matrix) 可以用来总结一个分类器的预测结果, 对于 $K$ 元分类, 其为一个 $K\times K$ 的矩阵, 不妨记为 $CM$. 其元素 $CM_{ij}$ 表示类别为 $i$ 的样本被预测为类别 $j$ 的数目.

由混淆矩阵, 我们可以得到很多关于测试集和分类器的基本信息或评估指标.

1.1. 基本信息

样本总数: $\operatorname{num\_samples} = N = \sum_{i=1}^{K} \sum_{j=1}^{K} CM_{ij}$

第 $i$ 类的 true positives 数目: $\operatorname{num\_true\_positives}_i = TP_i = CM_{ii}$

第 $i$ 类的 actual positives 数目: $\operatorname{num\_actual\_positives}_i = AP_i = \sum_{j=1}^{K} CM_{ij}$

第 $i$ 类的 predicted positives 数目: $\operatorname{num\_predicted\_positives}_i = PP_i = \sum_{j=1}^{K} CM_{ji}$

结论: $N = \sum_{i=1}^{K} AP_i = \sum_{i=1}^{K} PP_i$

证明:
$\sum_{i=1}^{K} AP_i = \sum_{i=1}^{K} \sum_{j=1}^{K} CM_{ij} = N$,
$\sum_{i=1}^{K} PP_i = \sum_{i=1}^{K} \sum_{j=1}^{K} CM_{ji} = N$

1.2 归一化混淆矩阵

按所在行的实际样本数归一化的混淆矩阵元素: $\frac{CM_{ij}}{AP_i}$

按所在列的预测样本数归一化的混淆矩阵元素: $\frac{CM_{ij}}{PP_j}$

按样本总数归一化的混淆矩阵元素: $\frac{CM_{ij}}{N}$

1.3. 准确率

准确率的定义比较特殊, 所以单独拎出来:

$\operatorname{accuracy} = \frac{\sum_{i=1}^K TP_i}{N}$

1.4. 各类的评估指标

第 $i$ 类的 recall: $\operatorname{recall}_i = \frac{TP_i}{AP_i}$

第 $i$ 类的 precision: $\operatorname{precision}_i = \frac{TP_i}{PP_i}$

第 $i$ 类的 f1 score: $\operatorname{f1\_score}_i = \frac{2 TP_i}{AP_i + PP_i}$

第 $i$ 类的 fbeta score: $\operatorname{f\beta\_score}_i = \frac{ (\beta^2 + 1) TP_i}{\beta^2AP_i + PP_i}$

第 $i$ 类的 jaccard index: $\operatorname{jaccard}_i = \frac{TP_i}{AP_i + PP_i - TP_i}$

注意到: 不考虑常数的话, 它们的分子均为 $TP_i$.

1.5. 各类评估指标的微平均

各类 recall 的微平均: $\operatorname{micro\_recall} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K AP_i}$

各类 precision 的微平均: $\operatorname{micro\_precision} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K PP_i}$

各类 f1 score 的微平均: $\operatorname{micro\_f1\_score} = \frac{\sum_{i=1}^K 2 TP_i}{\sum_{i=1}^K{AP_i + PP_i}}$

各类 fbeta score 的微平均: $\operatorname{micro\_f\beta\_score} = \frac{ \sum_{i=1}^K (\beta^2 + 1) TP_i}{\sum_{i=1}^K {\beta^2AP_i + PP_i}}$

各类 jaccard index 的微平均: $\operatorname{micro\_jaccard} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K {AP_i + PP_i - TP_i}}$

结论: $\operatorname{micro\_recall} = \operatorname{micro\_precision} = \operatorname{micro\_f1\_score} = \operatorname{micro\_f\beta\_score} = \operatorname{accuracy}$

证明: 因为 $N = \sum_{i=1}^{K} AP_i = \sum_{i=1}^{K} PP_i$, 所以:

$\operatorname{micro\_recall} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K AP_i} = \frac{\sum_{i=1}^K TP_i}{N}$

$\operatorname{micro\_precision} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K PP_i} = \frac{\sum_{i=1}^K TP_i}{N}$

$\operatorname{micro\_f1\_score} = \frac{\sum_{i=1}^K 2 TP_i}{\sum_{i=1}^K{AP_i + PP_i}}= \frac{\sum_{i=1}^K 2TP_i}{2N} = \frac{\sum_{i=1}^K TP_i}{N}$

$\operatorname{micro\_f\beta\_score} = \frac{ \sum_{i=1}^K (\beta^2 + 1) TP_i}{\sum_{i=1}^K {\beta^2AP_i + PP_i}} = \frac{ \sum_{i=1}^K (\beta^2 + 1) TP_i}{(\beta^2 + 1) N} = \frac{\sum_{i=1}^K TP_i}{N}$

1.6. 各类评估指标的宏平均

各类 recall 的宏平均: $\operatorname{macro\_recall} = \frac{1}{K} \sum_{i=1}^K \operatorname{recall}_i = \frac{1}{K} \sum_{i=1}^K \frac{TP_i}{AP_i}$

各类 precision 的宏平均: $\operatorname{macro\_precision} = \frac{1}{K} \sum_{i=1}^K \operatorname{precision}_i = \frac{1}{K} \sum_{i=1}^K \frac{ TP_i}{ PP_i}$

各类 f1 score 的宏平均: $\operatorname{macro\_f1\_score} = \frac{1}{K} \sum_{i=1}^K \operatorname{f1\_score}_i = \frac{1}{K} \sum_{i=1}^K \frac{2 TP_i}{AP_i + PP_i}$

各类 fbeta score 的宏平均: $\operatorname{macro\_f\beta\_score} = \frac{1}{K} \sum_{i=1}^K \operatorname{f\beta\_score}_i = \frac{1}{K} \sum_{i=1}^K \frac{ (\beta^2 + 1) TP_i}{\beta^2AP_i + PP_i}$

各类 jaccard index 的宏平均: $\operatorname{macro\_jaccard} = \frac{1}{K} \sum_{i=1}^K \operatorname{jaccard}_i = \frac{1}{K} \sum_{i=1}^K \frac{TP_i}{AP_i + PP_i - TP_i}$

1.7. 各类评估指标的加权宏平均

第 $i$ 类的样本频率: $\operatorname{frequency}_i = F_i = \frac{AP_i}{N}$

各类 recall 的加权宏平均: $\operatorname{weighted\_macro\_recall} = \sum_{i=1}^K F_i\operatorname{recall}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{TP_i}{AP_i}$

各类 precision 的加权宏平均: $\operatorname{weighted\_macro\_precision} = \sum_{i=1}^K F_i\operatorname{precision}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{ TP_i}{ PP_i}$

各类 f1 score 的加权宏平均: $\operatorname{weighted\_macro\_f1\_score} = \sum_{i=1}^K F_i\operatorname{f1\_score}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{2 TP_i}{AP_i + PP_i}$

各类 fbeta score 的加权宏平均: $\operatorname{weighted\_macro\_f\beta\_score} = \sum_{i=1}^K F_i\operatorname{f\beta\_score}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{ (\beta^2 + 1) TP_i}{\beta^2AP_i + PP_i}$

各类 jaccard index 的加权宏平均: $\operatorname{weighted\_macro\_jaccard} = \sum_{i=1}^K F_i\operatorname{jaccard}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{TP_i}{AP_i + PP_i - TP_i}$

结论: $\operatorname{weighted\_macro\_recall} = \operatorname{accuracy}$

证明: $\operatorname{weighted\_macro\_recall} = \sum_{i=1}^K \frac{AP_i}{N}\frac{TP_i}{AP_i} = \sum_{i=1}^K \frac{TP_i}{N} = \operatorname{accuracy}$

2. 混淆矩阵的代码实现

笔者实现了五个版本的混淆矩阵计算函数, 验证了与 sklearn 的一致性, 并比较了计算效率.

v1, v2 是直接实现, 显式使用了 for 循环, 效率较低.
v3 利用了混淆矩阵是二维直方图的性质, 效率居中.
v4 参考自 sklearn, 效率居中.
v5 参考自 torchvision, 效率较高, 推荐使用.

import khandy
import numpy as np
import sklearn
import sklearn.metrics
from scipy.sparse import coo_matrix


def get_confusion_matrix_v1(y_true, y_pred, num_classes):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_true = y_true.flatten()
    y_pred = y_pred.flatten()
    confusion_matrix = np.zeros((num_classes, num_classes), dtype=np.int64)
    for i, j in zip(y_true, y_pred):
        confusion_matrix[int(i), int(j)] += 1
    return confusion_matrix


def get_confusion_matrix_v2(y_true, y_pred, num_classes):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_true = y_true.flatten()
    y_pred = y_pred.flatten()
    confusion_matrix = np.zeros((num_classes, num_classes), dtype=np.int64)
    for i in range(num_classes):
        for j in range(num_classes):
             confusion_matrix[i, j] = np.count_nonzero( (y_true == i) & (y_pred == j))
    return confusion_matrix


def get_confusion_matrix_v3(y_true, y_pred, num_classes):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_true = y_true.flatten()
    y_pred = y_pred.flatten()
    return np.histogram2d(y_true, y_pred, num_classes)[0]


def get_confusion_matrix_v4(y_true, y_pred, num_classes):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_true = y_true.flatten()
    y_pred = y_pred.flatten()
    sample_weight = np.ones(y_true.shape[0], dtype=np.int64)
    confusion_matrix = coo_matrix((sample_weight, (y_true, y_pred)),
                                   shape=(num_classes, num_classes), dtype=np.int64,
                                   ).toarray()
    return confusion_matrix


def get_confusion_matrix_v5(y_true, y_pred, num_classes):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_true = y_true.flatten()
    y_pred = y_pred.flatten()
    confusion_matrix = np.bincount(num_classes * y_true + y_pred, minlength=num_classes**2)
    return confusion_matrix.reshape(num_classes, num_classes)


if __name__ == '__main__':
    num_classes, num_samples = np.random.randint(1, 100), 1000000
    y_true = np.random.randint(num_classes, size=num_samples)
    y_pred = np.random.randint(num_classes, size=num_samples)

    with khandy.ContextTimer(name='sklearn.metrics.confusion_matrix', use_log=False) as ct:
        confusion_matrix = sklearn.metrics.confusion_matrix(y_true, y_pred, labels=np.arange(num_classes))
    with khandy.ContextTimer(name='get_confusion_matrix_v1', use_log=False):
        confusion_matrix_v1 = get_confusion_matrix_v1(y_true, y_pred, num_classes)
    with khandy.ContextTimer(name='get_confusion_matrix_v2', use_log=False):
        confusion_matrix_v2 = get_confusion_matrix_v2(y_true, y_pred, num_classes)
    with khandy.ContextTimer(name='get_confusion_matrix_v3', use_log=False):
        confusion_matrix_v3 = get_confusion_matrix_v3(y_true, y_pred, num_classes)
    with khandy.ContextTimer(name='get_confusion_matrix_v4', use_log=False):
        confusion_matrix_v4 = get_confusion_matrix_v4(y_true, y_pred, num_classes)
    with khandy.ContextTimer(name='get_confusion_matrix_v5', use_log=False):
        confusion_matrix_v5 = get_confusion_matrix_v5(y_true, y_pred, num_classes)
    print(np.allclose(confusion_matrix, confusion_matrix_v1))
    print(np.allclose(confusion_matrix, confusion_matrix_v2))
    print(np.allclose(confusion_matrix, confusion_matrix_v3))
    print(np.allclose(confusion_matrix, confusion_matrix_v4))
    print(np.allclose(confusion_matrix, confusion_matrix_v5))

3. 混淆矩阵衍生指标的实现

在下面代码中, 直接通过混淆矩阵实现了各种衍生指标计算, 并验证了与 sklearn 的一致性 (sklearn 中, 当 precision, recall, f1 score 的分母为 0 时, 会报出警告, 提示其分母为 0 时, 其值设置为 0. 为了保证与 sklearn 的一致性, 笔者也参考了该做法. 在其他地方的实现中可能直接会忽略这样的值).

import numpy as np
import sklearn.metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import jaccard_score


def check(results1, results2, label, print_val=True):
    if print_val:
        print('{:<35}: {}, {:.5f}, {:.5f}'.format(label, np.allclose(results1, results2), results1, results2))
    else:
        print('{:<35}: {}'.format(label, np.allclose(results1, results2)))


if __name__ == '__main__':
    num_classes = np.random.randint(1, 100)
    num_samples = np.random.randint(50, 10000)
    labels = np.arange(num_classes)
    y_true = np.random.randint(num_classes, size=num_samples)
    y_pred = np.random.randint(num_classes, size=num_samples)
    confusion_matrix = sklearn.metrics.confusion_matrix(y_true, y_pred, labels=labels)

    # 样本总数
    num_samples = np.sum(confusion_matrix)
    # 各类的 true positives 数目
    num_true_positives = num_tp = np.diag(confusion_matrix)
    # 各类的 actual positives 数目
    num_actual_positives = num_ap = np.sum(confusion_matrix, axis=1)
    # 各类的 predicted positives 数目
    num_predicted_positives = num_pp = np.sum(confusion_matrix, axis=0)

    beta = np.random.uniform(0.1, 10)
    eps = np.finfo(np.float32).eps
    accuracy = np.sum(num_tp) / (num_samples + eps)

    # 每类的指标 (注意到它们的分子均有 num_tp)
    recalls = num_tp / (num_ap + eps)
    precisions = num_tp / (num_pp + eps)
    f1_scores = 2 * num_tp / (num_pp + num_ap + eps)
    fbeta_scores = (beta**2 + 1) * num_tp / (beta**2 * num_ap + num_pp + eps)
    jaccards = num_tp / (num_pp + num_ap - num_tp + eps)

    # 每类指标的微平均
    micro_recall = np.sum(num_tp) / (np.sum(num_ap) + eps)
    micro_precision = np.sum(num_tp) / (np.sum(num_pp) + eps)
    micro_f1_score = 2 * np.sum(num_tp) / (np.sum(num_pp + num_ap) + eps)
    micro_fbeta_score = (beta**2 + 1) * np.sum(num_tp) / (np.sum(beta**2 * num_ap + num_pp) + eps)
    micro_jaccard = np.sum(num_tp) / (np.sum(num_pp + num_ap - num_tp) + eps)

    # 每类指标的宏平均
    macro_recall = np.mean(recalls)
    macro_precision = np.mean(precisions)
    macro_f1_score = np.mean(f1_scores)
    macro_fbeta_score = np.mean(fbeta_scores)
    macro_jaccard = np.mean(jaccards)

    # 每类指标的加权宏平均
    freq = num_ap / (num_samples + eps)
    weighted_macro_recall = np.sum(freq * recalls)
    weighted_macro_precision = np.sum(freq * precisions)
    weighted_macro_f1_score = np.sum(freq * f1_scores)
    weighted_macro_fbeta_score = np.sum(freq * fbeta_scores)
    weighted_macro_jaccard = np.sum(freq * jaccards)

    # # 每类指标的加权宏平均 (等效实现)
    # weighted_macro_recall = np.average(recalls, weights=num_ap)
    # weighted_macro_precision = np.average(precisions, weights=num_ap)
    # weighted_macro_f1_score = np.average(f1_scores, weights=num_ap)
    # weighted_macro_fbeta_score = np.average(fbeta_scores, weights=num_ap)
    # weighted_macro_jaccard = np.average(jaccards, weights=num_ap)

    check(accuracy_score(y_true, y_pred), accuracy, 'accuracy')
    print('=====================================')
    check(recall_score(y_true, y_pred, labels=labels, average=None), 
          recalls, 'recalls', print_val=False)
    check(recall_score(y_true, y_pred, labels=labels, average='macro'), 
          macro_recall, 'macro_recall')
    check(recall_score(y_true, y_pred, labels=labels, average='micro'), 
          micro_recall, 'micro_recall')
    check(recall_score(y_true, y_pred, labels=labels, average='weighted'), 
          weighted_macro_recall, 'weighted_macro_recall')
    print('=====================================')
    check(precision_score(y_true, y_pred, labels=labels, average=None), 
          precisions, 'precisions', print_val=False)
    check(precision_score(y_true, y_pred, labels=labels, average='macro'), 
          macro_precision, 'macro_precision')
    check(precision_score(y_true, y_pred, labels=labels, average='micro'), 
          micro_precision, 'micro_precision')
    check(precision_score(y_true, y_pred, labels=labels, average='weighted'),
          weighted_macro_precision, 'weighted_macro_precision')
    print('=====================================')
    check(f1_score(y_true, y_pred, labels=labels, average=None), 
          f1_scores, 'f1_scores', print_val=False)
    check(f1_score(y_true, y_pred, labels=labels, average='macro'), 
          macro_f1_score, 'macro_f1_score')
    check(f1_score(y_true, y_pred, labels=labels, average='micro'), 
          micro_f1_score, 'micro_f1_score')
    check(f1_score(y_true, y_pred, labels=labels, average='weighted'), 
          weighted_macro_f1_score, 'weighted_macro_f1_score')
    print('=====================================')
    check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average=None), 
          fbeta_scores, 'fbeta_scores', print_val=False)
    check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average='macro'), 
          macro_fbeta_score, 'macro_fbeta_score')
    check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average='micro'), 
          micro_fbeta_score, 'micro_fbeta_score')
    check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average='weighted'), 
          weighted_macro_fbeta_score, 'weighted_macro_fbeta_score')
    print('=====================================')
    check(jaccard_score(y_true, y_pred, labels=labels, average=None), 
          jaccards, 'jaccards', print_val=False)
    check(jaccard_score(y_true, y_pred, labels=labels, average='macro'), 
          macro_jaccard, 'macro_jaccard')
    check(jaccard_score(y_true, y_pred, labels=labels, average='micro'), 
          micro_jaccard, 'micro_jaccard')
    check(jaccard_score(y_true, y_pred, labels=labels, average='weighted'), 
          weighted_macro_jaccard, 'weighted_macro_jaccard')
    print('=====================================')
    check(accuracy, micro_recall, 'accuracy == micro_recall')
    check(accuracy, micro_precision, 'accuracy == micro_precision')
    check(accuracy, micro_f1_score, 'accuracy == micro_f1_score')
    check(accuracy, micro_fbeta_score, 'accuracy == micro_fbeta_score')
    check(accuracy, weighted_macro_recall, 'accuracy == weighted_macro_recall')