1. 混淆矩阵及其衍生指标
混淆矩阵 (confusion matrix) 可以用来总结一个分类器的预测结果, 对于 $K$ 元分类, 其为一个 $K\times K$ 的矩阵, 不妨记为 $CM$. 其元素 $CM_{ij}$ 表示类别为 $i$ 的样本被预测为类别 $j$ 的数目.
由混淆矩阵, 我们可以得到很多关于测试集和分类器的基本信息或评估指标.
1.1. 基本信息
样本总数: $\operatorname{num\_samples} = N = \sum_{i=1}^{K} \sum_{j=1}^{K} CM_{ij}$
第 $i$ 类的 true positives 数目: $\operatorname{num\_true\_positives}_i = TP_i = CM_{ii}$
第 $i$ 类的 actual positives 数目: $\operatorname{num\_actual\_positives}_i = AP_i = \sum_{j=1}^{K} CM_{ij}$
第 $i$ 类的 predicted positives 数目: $\operatorname{num\_predicted\_positives}_i = PP_i = \sum_{j=1}^{K} CM_{ji}$
结论: $N = \sum_{i=1}^{K} AP_i = \sum_{i=1}^{K} PP_i$
证明: 
$\sum_{i=1}^{K} AP_i = \sum_{i=1}^{K} \sum_{j=1}^{K} CM_{ij} = N$, 
$\sum_{i=1}^{K} PP_i = \sum_{i=1}^{K} \sum_{j=1}^{K} CM_{ji} = N$
1.2 归一化混淆矩阵
按所在行的实际样本数归一化的混淆矩阵元素: $\frac{CM_{ij}}{AP_i}$
按所在列的预测样本数归一化的混淆矩阵元素: $\frac{CM_{ij}}{PP_j}$
按样本总数归一化的混淆矩阵元素: $\frac{CM_{ij}}{N}$
1.3. 准确率
准确率的定义比较特殊, 所以单独拎出来:
$\operatorname{accuracy} = \frac{\sum_{i=1}^K TP_i}{N}$
1.4. 各类的评估指标
第 $i$ 类的 recall: $\operatorname{recall}_i = \frac{TP_i}{AP_i}$
第 $i$ 类的 precision: $\operatorname{precision}_i = \frac{TP_i}{PP_i}$
第 $i$ 类的 f1 score: $\operatorname{f1\_score}_i = \frac{2 TP_i}{AP_i + PP_i}$
第 $i$ 类的 fbeta score: $\operatorname{f\beta\_score}_i = \frac{ (\beta^2 + 1) TP_i}{\beta^2AP_i + PP_i}$
第 $i$ 类的 jaccard index: $\operatorname{jaccard}_i = \frac{TP_i}{AP_i + PP_i - TP_i}$
注意到: 不考虑常数的话, 它们的分子均为 $TP_i$.
1.5. 各类评估指标的微平均
各类 recall 的微平均: $\operatorname{micro\_recall} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K AP_i}$
各类 precision 的微平均: $\operatorname{micro\_precision} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K PP_i}$
各类 f1 score 的微平均: $\operatorname{micro\_f1\_score} = \frac{\sum_{i=1}^K 2 TP_i}{\sum_{i=1}^K{AP_i + PP_i}}$
各类 fbeta score 的微平均: $\operatorname{micro\_f\beta\_score} = \frac{ \sum_{i=1}^K (\beta^2 + 1) TP_i}{\sum_{i=1}^K {\beta^2AP_i + PP_i}}$
各类 jaccard index 的微平均: $\operatorname{micro\_jaccard} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K {AP_i + PP_i - TP_i}}$
结论: $\operatorname{micro\_recall} = \operatorname{micro\_precision} = \operatorname{micro\_f1\_score} = \operatorname{micro\_f\beta\_score} = \operatorname{accuracy}$
证明: 因为 $N = \sum_{i=1}^{K} AP_i = \sum_{i=1}^{K} PP_i$, 所以:
$\operatorname{micro\_recall} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K AP_i} = \frac{\sum_{i=1}^K TP_i}{N}$
$\operatorname{micro\_precision} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K PP_i} = \frac{\sum_{i=1}^K TP_i}{N}$
$\operatorname{micro\_f1\_score} = \frac{\sum_{i=1}^K 2 TP_i}{\sum_{i=1}^K{AP_i + PP_i}}= \frac{\sum_{i=1}^K 2TP_i}{2N} = \frac{\sum_{i=1}^K TP_i}{N}$
$\operatorname{micro\_f\beta\_score} = \frac{ \sum_{i=1}^K (\beta^2 + 1) TP_i}{\sum_{i=1}^K {\beta^2AP_i + PP_i}} = \frac{ \sum_{i=1}^K (\beta^2 + 1) TP_i}{(\beta^2 + 1) N} = \frac{\sum_{i=1}^K TP_i}{N}$
1.6. 各类评估指标的宏平均
各类 recall 的宏平均: $\operatorname{macro\_recall} = \frac{1}{K} \sum_{i=1}^K \operatorname{recall}_i = \frac{1}{K} \sum_{i=1}^K \frac{TP_i}{AP_i}$
各类 precision 的宏平均: $\operatorname{macro\_precision} = \frac{1}{K} \sum_{i=1}^K \operatorname{precision}_i = \frac{1}{K} \sum_{i=1}^K \frac{ TP_i}{ PP_i}$
各类 f1 score 的宏平均: $\operatorname{macro\_f1\_score} = \frac{1}{K} \sum_{i=1}^K \operatorname{f1\_score}_i = \frac{1}{K} \sum_{i=1}^K \frac{2 TP_i}{AP_i + PP_i}$
各类 fbeta score 的宏平均: $\operatorname{macro\_f\beta\_score} = \frac{1}{K} \sum_{i=1}^K \operatorname{f\beta\_score}_i = \frac{1}{K} \sum_{i=1}^K \frac{ (\beta^2 + 1) TP_i}{\beta^2AP_i + PP_i}$
各类 jaccard index 的宏平均: $\operatorname{macro\_jaccard} = \frac{1}{K} \sum_{i=1}^K \operatorname{jaccard}_i = \frac{1}{K} \sum_{i=1}^K \frac{TP_i}{AP_i + PP_i - TP_i}$
1.7. 各类评估指标的加权宏平均
第 $i$ 类的样本频率: $\operatorname{frequency}_i = F_i = \frac{AP_i}{N}$
各类 recall 的加权宏平均: $\operatorname{weighted\_macro\_recall} = \sum_{i=1}^K F_i\operatorname{recall}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{TP_i}{AP_i}$
各类 precision 的加权宏平均: $\operatorname{weighted\_macro\_precision} = \sum_{i=1}^K F_i\operatorname{precision}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{ TP_i}{ PP_i}$
各类 f1 score 的加权宏平均: $\operatorname{weighted\_macro\_f1\_score} = \sum_{i=1}^K F_i\operatorname{f1\_score}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{2 TP_i}{AP_i + PP_i}$
各类 fbeta score 的加权宏平均: $\operatorname{weighted\_macro\_f\beta\_score} = \sum_{i=1}^K F_i\operatorname{f\beta\_score}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{ (\beta^2 + 1) TP_i}{\beta^2AP_i + PP_i}$
各类 jaccard index 的加权宏平均: $\operatorname{weighted\_macro\_jaccard} = \sum_{i=1}^K F_i\operatorname{jaccard}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{TP_i}{AP_i + PP_i - TP_i}$
结论: $\operatorname{weighted\_macro\_recall} = \operatorname{accuracy}$
证明: $\operatorname{weighted\_macro\_recall} = \sum_{i=1}^K \frac{AP_i}{N}\frac{TP_i}{AP_i} = \sum_{i=1}^K \frac{TP_i}{N} = \operatorname{accuracy}$
2. 混淆矩阵的代码实现
笔者实现了五个版本的混淆矩阵计算函数, 验证了与 sklearn 的一致性, 并比较了计算效率.
- v1, v2 是直接实现, 显式使用了 for 循环, 效率较低.
- v3 利用了混淆矩阵是二维直方图的性质, 效率居中.
- v4 参考自 sklearn, 效率居中.
- v5 参考自 torchvision, 效率较高, 推荐使用.
import khandy
import numpy as np
import sklearn
import sklearn.metrics
from scipy.sparse import coo_matrix
def get_confusion_matrix_v1(y_true, y_pred, num_classes):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_true = y_true.flatten()
    y_pred = y_pred.flatten()
    confusion_matrix = np.zeros((num_classes, num_classes), dtype=np.int64)
    for i, j in zip(y_true, y_pred):
        confusion_matrix[int(i), int(j)] += 1
    return confusion_matrix
def get_confusion_matrix_v2(y_true, y_pred, num_classes):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_true = y_true.flatten()
    y_pred = y_pred.flatten()
    confusion_matrix = np.zeros((num_classes, num_classes), dtype=np.int64)
    for i in range(num_classes):
        for j in range(num_classes):
             confusion_matrix[i, j] = np.count_nonzero( (y_true == i) & (y_pred == j))
    return confusion_matrix
def get_confusion_matrix_v3(y_true, y_pred, num_classes):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_true = y_true.flatten()
    y_pred = y_pred.flatten()
    return np.histogram2d(y_true, y_pred, num_classes)[0]
def get_confusion_matrix_v4(y_true, y_pred, num_classes):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_true = y_true.flatten()
    y_pred = y_pred.flatten()
    sample_weight = np.ones(y_true.shape[0], dtype=np.int64)
    confusion_matrix = coo_matrix((sample_weight, (y_true, y_pred)),
                                   shape=(num_classes, num_classes), dtype=np.int64,
                                   ).toarray()
    return confusion_matrix
def get_confusion_matrix_v5(y_true, y_pred, num_classes):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_true = y_true.flatten()
    y_pred = y_pred.flatten()
    confusion_matrix = np.bincount(num_classes * y_true + y_pred, minlength=num_classes**2)
    return confusion_matrix.reshape(num_classes, num_classes)
if __name__ == '__main__':
    num_classes, num_samples = np.random.randint(1, 100), 1000000
    y_true = np.random.randint(num_classes, size=num_samples)
    y_pred = np.random.randint(num_classes, size=num_samples)
    with khandy.ContextTimer(name='sklearn.metrics.confusion_matrix', use_log=False) as ct:
        confusion_matrix = sklearn.metrics.confusion_matrix(y_true, y_pred, labels=np.arange(num_classes))
    with khandy.ContextTimer(name='get_confusion_matrix_v1', use_log=False):
        confusion_matrix_v1 = get_confusion_matrix_v1(y_true, y_pred, num_classes)
    with khandy.ContextTimer(name='get_confusion_matrix_v2', use_log=False):
        confusion_matrix_v2 = get_confusion_matrix_v2(y_true, y_pred, num_classes)
    with khandy.ContextTimer(name='get_confusion_matrix_v3', use_log=False):
        confusion_matrix_v3 = get_confusion_matrix_v3(y_true, y_pred, num_classes)
    with khandy.ContextTimer(name='get_confusion_matrix_v4', use_log=False):
        confusion_matrix_v4 = get_confusion_matrix_v4(y_true, y_pred, num_classes)
    with khandy.ContextTimer(name='get_confusion_matrix_v5', use_log=False):
        confusion_matrix_v5 = get_confusion_matrix_v5(y_true, y_pred, num_classes)
    print(np.allclose(confusion_matrix, confusion_matrix_v1))
    print(np.allclose(confusion_matrix, confusion_matrix_v2))
    print(np.allclose(confusion_matrix, confusion_matrix_v3))
    print(np.allclose(confusion_matrix, confusion_matrix_v4))
    print(np.allclose(confusion_matrix, confusion_matrix_v5))
3. 混淆矩阵衍生指标的实现
在下面代码中, 直接通过混淆矩阵实现了各种衍生指标计算, 并验证了与 sklearn 的一致性 (sklearn 中, 当 precision, recall, f1 score 的分母为 0 时, 会报出警告, 提示其分母为 0 时, 其值设置为 0. 为了保证与 sklearn 的一致性, 笔者也参考了该做法. 在其他地方的实现中可能直接会忽略这样的值).
import numpy as np
import sklearn.metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import jaccard_score
def check(results1, results2, label, print_val=True):
    if print_val:
        print('{:<35}: {}, {:.5f}, {:.5f}'.format(label, np.allclose(results1, results2), results1, results2))
    else:
        print('{:<35}: {}'.format(label, np.allclose(results1, results2)))
if __name__ == '__main__':
    num_classes = np.random.randint(1, 100)
    num_samples = np.random.randint(50, 10000)
    labels = np.arange(num_classes)
    y_true = np.random.randint(num_classes, size=num_samples)
    y_pred = np.random.randint(num_classes, size=num_samples)
    confusion_matrix = sklearn.metrics.confusion_matrix(y_true, y_pred, labels=labels)
    # 样本总数
    num_samples = np.sum(confusion_matrix)
    # 各类的 true positives 数目
    num_true_positives = num_tp = np.diag(confusion_matrix)
    # 各类的 actual positives 数目
    num_actual_positives = num_ap = np.sum(confusion_matrix, axis=1)
    # 各类的 predicted positives 数目
    num_predicted_positives = num_pp = np.sum(confusion_matrix, axis=0)
    beta = np.random.uniform(0.1, 10)
    eps = np.finfo(np.float32).eps
    accuracy = np.sum(num_tp) / (num_samples + eps)
    # 每类的指标 (注意到它们的分子均有 num_tp)
    recalls = num_tp / (num_ap + eps)
    precisions = num_tp / (num_pp + eps)
    f1_scores = 2 * num_tp / (num_pp + num_ap + eps)
    fbeta_scores = (beta**2 + 1) * num_tp / (beta**2 * num_ap + num_pp + eps)
    jaccards = num_tp / (num_pp + num_ap - num_tp + eps)
    # 每类指标的微平均
    micro_recall = np.sum(num_tp) / (np.sum(num_ap) + eps)
    micro_precision = np.sum(num_tp) / (np.sum(num_pp) + eps)
    micro_f1_score = 2 * np.sum(num_tp) / (np.sum(num_pp + num_ap) + eps)
    micro_fbeta_score = (beta**2 + 1) * np.sum(num_tp) / (np.sum(beta**2 * num_ap + num_pp) + eps)
    micro_jaccard = np.sum(num_tp) / (np.sum(num_pp + num_ap - num_tp) + eps)
    # 每类指标的宏平均
    macro_recall = np.mean(recalls)
    macro_precision = np.mean(precisions)
    macro_f1_score = np.mean(f1_scores)
    macro_fbeta_score = np.mean(fbeta_scores)
    macro_jaccard = np.mean(jaccards)
    # 每类指标的加权宏平均
    freq = num_ap / (num_samples + eps)
    weighted_macro_recall = np.sum(freq * recalls)
    weighted_macro_precision = np.sum(freq * precisions)
    weighted_macro_f1_score = np.sum(freq * f1_scores)
    weighted_macro_fbeta_score = np.sum(freq * fbeta_scores)
    weighted_macro_jaccard = np.sum(freq * jaccards)
    # # 每类指标的加权宏平均 (等效实现)
    # weighted_macro_recall = np.average(recalls, weights=num_ap)
    # weighted_macro_precision = np.average(precisions, weights=num_ap)
    # weighted_macro_f1_score = np.average(f1_scores, weights=num_ap)
    # weighted_macro_fbeta_score = np.average(fbeta_scores, weights=num_ap)
    # weighted_macro_jaccard = np.average(jaccards, weights=num_ap)
    check(accuracy_score(y_true, y_pred), accuracy, 'accuracy')
    print('=====================================')
    check(recall_score(y_true, y_pred, labels=labels, average=None), 
          recalls, 'recalls', print_val=False)
    check(recall_score(y_true, y_pred, labels=labels, average='macro'), 
          macro_recall, 'macro_recall')
    check(recall_score(y_true, y_pred, labels=labels, average='micro'), 
          micro_recall, 'micro_recall')
    check(recall_score(y_true, y_pred, labels=labels, average='weighted'), 
          weighted_macro_recall, 'weighted_macro_recall')
    print('=====================================')
    check(precision_score(y_true, y_pred, labels=labels, average=None), 
          precisions, 'precisions', print_val=False)
    check(precision_score(y_true, y_pred, labels=labels, average='macro'), 
          macro_precision, 'macro_precision')
    check(precision_score(y_true, y_pred, labels=labels, average='micro'), 
          micro_precision, 'micro_precision')
    check(precision_score(y_true, y_pred, labels=labels, average='weighted'),
          weighted_macro_precision, 'weighted_macro_precision')
    print('=====================================')
    check(f1_score(y_true, y_pred, labels=labels, average=None), 
          f1_scores, 'f1_scores', print_val=False)
    check(f1_score(y_true, y_pred, labels=labels, average='macro'), 
          macro_f1_score, 'macro_f1_score')
    check(f1_score(y_true, y_pred, labels=labels, average='micro'), 
          micro_f1_score, 'micro_f1_score')
    check(f1_score(y_true, y_pred, labels=labels, average='weighted'), 
          weighted_macro_f1_score, 'weighted_macro_f1_score')
    print('=====================================')
    check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average=None), 
          fbeta_scores, 'fbeta_scores', print_val=False)
    check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average='macro'), 
          macro_fbeta_score, 'macro_fbeta_score')
    check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average='micro'), 
          micro_fbeta_score, 'micro_fbeta_score')
    check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average='weighted'), 
          weighted_macro_fbeta_score, 'weighted_macro_fbeta_score')
    print('=====================================')
    check(jaccard_score(y_true, y_pred, labels=labels, average=None), 
          jaccards, 'jaccards', print_val=False)
    check(jaccard_score(y_true, y_pred, labels=labels, average='macro'), 
          macro_jaccard, 'macro_jaccard')
    check(jaccard_score(y_true, y_pred, labels=labels, average='micro'), 
          micro_jaccard, 'micro_jaccard')
    check(jaccard_score(y_true, y_pred, labels=labels, average='weighted'), 
          weighted_macro_jaccard, 'weighted_macro_jaccard')
    print('=====================================')
    check(accuracy, micro_recall, 'accuracy == micro_recall')
    check(accuracy, micro_precision, 'accuracy == micro_precision')
    check(accuracy, micro_f1_score, 'accuracy == micro_f1_score')
    check(accuracy, micro_fbeta_score, 'accuracy == micro_fbeta_score')
    check(accuracy, weighted_macro_recall, 'accuracy == weighted_macro_recall')
4. 参考
- sklearn.metrics.confusion_matrix
- sklearn.metrics.accuracy_score
- sklearn.metrics.recall_score
- sklearn.metrics.precision_score
- sklearn.metrics.f1_score
- sklearn.metrics.fbeta_score
- sklearn.metrics.jaccard_score
5. 更新记录
- 20211020, 发布
- 20211021, 添加 "归一化混淆矩阵"
6. 版权声明
自由转载-非商用-非衍生-保持署名 (创意共享3.0许可证)