1. 混淆矩阵及其衍生指标
混淆矩阵 (confusion matrix) 可以用来总结一个分类器的预测结果, 对于 $K$ 元分类, 其为一个 $K\times K$ 的矩阵, 不妨记为 $CM$. 其元素 $CM_{ij}$ 表示类别为 $i$ 的样本被预测为类别 $j$ 的数目.
由混淆矩阵, 我们可以得到很多关于测试集和分类器的基本信息或评估指标.
1.1. 基本信息
样本总数: $\operatorname{num\_samples} = N = \sum_{i=1}^{K} \sum_{j=1}^{K} CM_{ij}$
第 $i$ 类的 true positives 数目: $\operatorname{num\_true\_positives}_i = TP_i = CM_{ii}$
第 $i$ 类的 actual positives 数目: $\operatorname{num\_actual\_positives}_i = AP_i = \sum_{j=1}^{K} CM_{ij}$
第 $i$ 类的 predicted positives 数目: $\operatorname{num\_predicted\_positives}_i = PP_i = \sum_{j=1}^{K} CM_{ji}$
结论: $N = \sum_{i=1}^{K} AP_i = \sum_{i=1}^{K} PP_i$
证明:
$\sum_{i=1}^{K} AP_i = \sum_{i=1}^{K} \sum_{j=1}^{K} CM_{ij} = N$,
$\sum_{i=1}^{K} PP_i = \sum_{i=1}^{K} \sum_{j=1}^{K} CM_{ji} = N$
1.2 归一化混淆矩阵
按所在行的实际样本数归一化的混淆矩阵元素: $\frac{CM_{ij}}{AP_i}$
按所在列的预测样本数归一化的混淆矩阵元素: $\frac{CM_{ij}}{PP_j}$
按样本总数归一化的混淆矩阵元素: $\frac{CM_{ij}}{N}$
1.3. 准确率
准确率的定义比较特殊, 所以单独拎出来:
$\operatorname{accuracy} = \frac{\sum_{i=1}^K TP_i}{N}$
1.4. 各类的评估指标
第 $i$ 类的 recall: $\operatorname{recall}_i = \frac{TP_i}{AP_i}$
第 $i$ 类的 precision: $\operatorname{precision}_i = \frac{TP_i}{PP_i}$
第 $i$ 类的 f1 score: $\operatorname{f1\_score}_i = \frac{2 TP_i}{AP_i + PP_i}$
第 $i$ 类的 fbeta score: $\operatorname{f\beta\_score}_i = \frac{ (\beta^2 + 1) TP_i}{\beta^2AP_i + PP_i}$
第 $i$ 类的 jaccard index: $\operatorname{jaccard}_i = \frac{TP_i}{AP_i + PP_i - TP_i}$
注意到: 不考虑常数的话, 它们的分子均为 $TP_i$.
1.5. 各类评估指标的微平均
各类 recall 的微平均: $\operatorname{micro\_recall} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K AP_i}$
各类 precision 的微平均: $\operatorname{micro\_precision} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K PP_i}$
各类 f1 score 的微平均: $\operatorname{micro\_f1\_score} = \frac{\sum_{i=1}^K 2 TP_i}{\sum_{i=1}^K{AP_i + PP_i}}$
各类 fbeta score 的微平均: $\operatorname{micro\_f\beta\_score} = \frac{ \sum_{i=1}^K (\beta^2 + 1) TP_i}{\sum_{i=1}^K {\beta^2AP_i + PP_i}}$
各类 jaccard index 的微平均: $\operatorname{micro\_jaccard} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K {AP_i + PP_i - TP_i}}$
结论: $\operatorname{micro\_recall} = \operatorname{micro\_precision} = \operatorname{micro\_f1\_score} = \operatorname{micro\_f\beta\_score} = \operatorname{accuracy}$
证明: 因为 $N = \sum_{i=1}^{K} AP_i = \sum_{i=1}^{K} PP_i$, 所以:
$\operatorname{micro\_recall} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K AP_i} = \frac{\sum_{i=1}^K TP_i}{N}$
$\operatorname{micro\_precision} = \frac{\sum_{i=1}^K TP_i}{\sum_{i=1}^K PP_i} = \frac{\sum_{i=1}^K TP_i}{N}$
$\operatorname{micro\_f1\_score} = \frac{\sum_{i=1}^K 2 TP_i}{\sum_{i=1}^K{AP_i + PP_i}}= \frac{\sum_{i=1}^K 2TP_i}{2N} = \frac{\sum_{i=1}^K TP_i}{N}$
$\operatorname{micro\_f\beta\_score} = \frac{ \sum_{i=1}^K (\beta^2 + 1) TP_i}{\sum_{i=1}^K {\beta^2AP_i + PP_i}} = \frac{ \sum_{i=1}^K (\beta^2 + 1) TP_i}{(\beta^2 + 1) N} = \frac{\sum_{i=1}^K TP_i}{N}$
1.6. 各类评估指标的宏平均
各类 recall 的宏平均: $\operatorname{macro\_recall} = \frac{1}{K} \sum_{i=1}^K \operatorname{recall}_i = \frac{1}{K} \sum_{i=1}^K \frac{TP_i}{AP_i}$
各类 precision 的宏平均: $\operatorname{macro\_precision} = \frac{1}{K} \sum_{i=1}^K \operatorname{precision}_i = \frac{1}{K} \sum_{i=1}^K \frac{ TP_i}{ PP_i}$
各类 f1 score 的宏平均: $\operatorname{macro\_f1\_score} = \frac{1}{K} \sum_{i=1}^K \operatorname{f1\_score}_i = \frac{1}{K} \sum_{i=1}^K \frac{2 TP_i}{AP_i + PP_i}$
各类 fbeta score 的宏平均: $\operatorname{macro\_f\beta\_score} = \frac{1}{K} \sum_{i=1}^K \operatorname{f\beta\_score}_i = \frac{1}{K} \sum_{i=1}^K \frac{ (\beta^2 + 1) TP_i}{\beta^2AP_i + PP_i}$
各类 jaccard index 的宏平均: $\operatorname{macro\_jaccard} = \frac{1}{K} \sum_{i=1}^K \operatorname{jaccard}_i = \frac{1}{K} \sum_{i=1}^K \frac{TP_i}{AP_i + PP_i - TP_i}$
1.7. 各类评估指标的加权宏平均
第 $i$ 类的样本频率: $\operatorname{frequency}_i = F_i = \frac{AP_i}{N}$
各类 recall 的加权宏平均: $\operatorname{weighted\_macro\_recall} = \sum_{i=1}^K F_i\operatorname{recall}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{TP_i}{AP_i}$
各类 precision 的加权宏平均: $\operatorname{weighted\_macro\_precision} = \sum_{i=1}^K F_i\operatorname{precision}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{ TP_i}{ PP_i}$
各类 f1 score 的加权宏平均: $\operatorname{weighted\_macro\_f1\_score} = \sum_{i=1}^K F_i\operatorname{f1\_score}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{2 TP_i}{AP_i + PP_i}$
各类 fbeta score 的加权宏平均: $\operatorname{weighted\_macro\_f\beta\_score} = \sum_{i=1}^K F_i\operatorname{f\beta\_score}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{ (\beta^2 + 1) TP_i}{\beta^2AP_i + PP_i}$
各类 jaccard index 的加权宏平均: $\operatorname{weighted\_macro\_jaccard} = \sum_{i=1}^K F_i\operatorname{jaccard}_i = \sum_{i=1}^K \frac{AP_i}{N}\frac{TP_i}{AP_i + PP_i - TP_i}$
结论: $\operatorname{weighted\_macro\_recall} = \operatorname{accuracy}$
证明: $\operatorname{weighted\_macro\_recall} = \sum_{i=1}^K \frac{AP_i}{N}\frac{TP_i}{AP_i} = \sum_{i=1}^K \frac{TP_i}{N} = \operatorname{accuracy}$
2. 混淆矩阵的代码实现
笔者实现了五个版本的混淆矩阵计算函数, 验证了与 sklearn 的一致性, 并比较了计算效率.
- v1, v2 是直接实现, 显式使用了 for 循环, 效率较低.
- v3 利用了混淆矩阵是二维直方图的性质, 效率居中.
- v4 参考自 sklearn, 效率居中.
- v5 参考自 torchvision, 效率较高, 推荐使用.
import khandy
import numpy as np
import sklearn
import sklearn.metrics
from scipy.sparse import coo_matrix
def get_confusion_matrix_v1(y_true, y_pred, num_classes):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
y_true = y_true.flatten()
y_pred = y_pred.flatten()
confusion_matrix = np.zeros((num_classes, num_classes), dtype=np.int64)
for i, j in zip(y_true, y_pred):
confusion_matrix[int(i), int(j)] += 1
return confusion_matrix
def get_confusion_matrix_v2(y_true, y_pred, num_classes):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
y_true = y_true.flatten()
y_pred = y_pred.flatten()
confusion_matrix = np.zeros((num_classes, num_classes), dtype=np.int64)
for i in range(num_classes):
for j in range(num_classes):
confusion_matrix[i, j] = np.count_nonzero( (y_true == i) & (y_pred == j))
return confusion_matrix
def get_confusion_matrix_v3(y_true, y_pred, num_classes):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
y_true = y_true.flatten()
y_pred = y_pred.flatten()
return np.histogram2d(y_true, y_pred, num_classes)[0]
def get_confusion_matrix_v4(y_true, y_pred, num_classes):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
y_true = y_true.flatten()
y_pred = y_pred.flatten()
sample_weight = np.ones(y_true.shape[0], dtype=np.int64)
confusion_matrix = coo_matrix((sample_weight, (y_true, y_pred)),
shape=(num_classes, num_classes), dtype=np.int64,
).toarray()
return confusion_matrix
def get_confusion_matrix_v5(y_true, y_pred, num_classes):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
y_true = y_true.flatten()
y_pred = y_pred.flatten()
confusion_matrix = np.bincount(num_classes * y_true + y_pred, minlength=num_classes**2)
return confusion_matrix.reshape(num_classes, num_classes)
if __name__ == '__main__':
num_classes, num_samples = np.random.randint(1, 100), 1000000
y_true = np.random.randint(num_classes, size=num_samples)
y_pred = np.random.randint(num_classes, size=num_samples)
with khandy.ContextTimer(name='sklearn.metrics.confusion_matrix', use_log=False) as ct:
confusion_matrix = sklearn.metrics.confusion_matrix(y_true, y_pred, labels=np.arange(num_classes))
with khandy.ContextTimer(name='get_confusion_matrix_v1', use_log=False):
confusion_matrix_v1 = get_confusion_matrix_v1(y_true, y_pred, num_classes)
with khandy.ContextTimer(name='get_confusion_matrix_v2', use_log=False):
confusion_matrix_v2 = get_confusion_matrix_v2(y_true, y_pred, num_classes)
with khandy.ContextTimer(name='get_confusion_matrix_v3', use_log=False):
confusion_matrix_v3 = get_confusion_matrix_v3(y_true, y_pred, num_classes)
with khandy.ContextTimer(name='get_confusion_matrix_v4', use_log=False):
confusion_matrix_v4 = get_confusion_matrix_v4(y_true, y_pred, num_classes)
with khandy.ContextTimer(name='get_confusion_matrix_v5', use_log=False):
confusion_matrix_v5 = get_confusion_matrix_v5(y_true, y_pred, num_classes)
print(np.allclose(confusion_matrix, confusion_matrix_v1))
print(np.allclose(confusion_matrix, confusion_matrix_v2))
print(np.allclose(confusion_matrix, confusion_matrix_v3))
print(np.allclose(confusion_matrix, confusion_matrix_v4))
print(np.allclose(confusion_matrix, confusion_matrix_v5))
3. 混淆矩阵衍生指标的实现
在下面代码中, 直接通过混淆矩阵实现了各种衍生指标计算, 并验证了与 sklearn 的一致性 (sklearn 中, 当 precision, recall, f1 score 的分母为 0 时, 会报出警告, 提示其分母为 0 时, 其值设置为 0. 为了保证与 sklearn 的一致性, 笔者也参考了该做法. 在其他地方的实现中可能直接会忽略这样的值).
import numpy as np
import sklearn.metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import jaccard_score
def check(results1, results2, label, print_val=True):
if print_val:
print('{:<35}: {}, {:.5f}, {:.5f}'.format(label, np.allclose(results1, results2), results1, results2))
else:
print('{:<35}: {}'.format(label, np.allclose(results1, results2)))
if __name__ == '__main__':
num_classes = np.random.randint(1, 100)
num_samples = np.random.randint(50, 10000)
labels = np.arange(num_classes)
y_true = np.random.randint(num_classes, size=num_samples)
y_pred = np.random.randint(num_classes, size=num_samples)
confusion_matrix = sklearn.metrics.confusion_matrix(y_true, y_pred, labels=labels)
# 样本总数
num_samples = np.sum(confusion_matrix)
# 各类的 true positives 数目
num_true_positives = num_tp = np.diag(confusion_matrix)
# 各类的 actual positives 数目
num_actual_positives = num_ap = np.sum(confusion_matrix, axis=1)
# 各类的 predicted positives 数目
num_predicted_positives = num_pp = np.sum(confusion_matrix, axis=0)
beta = np.random.uniform(0.1, 10)
eps = np.finfo(np.float32).eps
accuracy = np.sum(num_tp) / (num_samples + eps)
# 每类的指标 (注意到它们的分子均有 num_tp)
recalls = num_tp / (num_ap + eps)
precisions = num_tp / (num_pp + eps)
f1_scores = 2 * num_tp / (num_pp + num_ap + eps)
fbeta_scores = (beta**2 + 1) * num_tp / (beta**2 * num_ap + num_pp + eps)
jaccards = num_tp / (num_pp + num_ap - num_tp + eps)
# 每类指标的微平均
micro_recall = np.sum(num_tp) / (np.sum(num_ap) + eps)
micro_precision = np.sum(num_tp) / (np.sum(num_pp) + eps)
micro_f1_score = 2 * np.sum(num_tp) / (np.sum(num_pp + num_ap) + eps)
micro_fbeta_score = (beta**2 + 1) * np.sum(num_tp) / (np.sum(beta**2 * num_ap + num_pp) + eps)
micro_jaccard = np.sum(num_tp) / (np.sum(num_pp + num_ap - num_tp) + eps)
# 每类指标的宏平均
macro_recall = np.mean(recalls)
macro_precision = np.mean(precisions)
macro_f1_score = np.mean(f1_scores)
macro_fbeta_score = np.mean(fbeta_scores)
macro_jaccard = np.mean(jaccards)
# 每类指标的加权宏平均
freq = num_ap / (num_samples + eps)
weighted_macro_recall = np.sum(freq * recalls)
weighted_macro_precision = np.sum(freq * precisions)
weighted_macro_f1_score = np.sum(freq * f1_scores)
weighted_macro_fbeta_score = np.sum(freq * fbeta_scores)
weighted_macro_jaccard = np.sum(freq * jaccards)
# # 每类指标的加权宏平均 (等效实现)
# weighted_macro_recall = np.average(recalls, weights=num_ap)
# weighted_macro_precision = np.average(precisions, weights=num_ap)
# weighted_macro_f1_score = np.average(f1_scores, weights=num_ap)
# weighted_macro_fbeta_score = np.average(fbeta_scores, weights=num_ap)
# weighted_macro_jaccard = np.average(jaccards, weights=num_ap)
check(accuracy_score(y_true, y_pred), accuracy, 'accuracy')
print('=====================================')
check(recall_score(y_true, y_pred, labels=labels, average=None),
recalls, 'recalls', print_val=False)
check(recall_score(y_true, y_pred, labels=labels, average='macro'),
macro_recall, 'macro_recall')
check(recall_score(y_true, y_pred, labels=labels, average='micro'),
micro_recall, 'micro_recall')
check(recall_score(y_true, y_pred, labels=labels, average='weighted'),
weighted_macro_recall, 'weighted_macro_recall')
print('=====================================')
check(precision_score(y_true, y_pred, labels=labels, average=None),
precisions, 'precisions', print_val=False)
check(precision_score(y_true, y_pred, labels=labels, average='macro'),
macro_precision, 'macro_precision')
check(precision_score(y_true, y_pred, labels=labels, average='micro'),
micro_precision, 'micro_precision')
check(precision_score(y_true, y_pred, labels=labels, average='weighted'),
weighted_macro_precision, 'weighted_macro_precision')
print('=====================================')
check(f1_score(y_true, y_pred, labels=labels, average=None),
f1_scores, 'f1_scores', print_val=False)
check(f1_score(y_true, y_pred, labels=labels, average='macro'),
macro_f1_score, 'macro_f1_score')
check(f1_score(y_true, y_pred, labels=labels, average='micro'),
micro_f1_score, 'micro_f1_score')
check(f1_score(y_true, y_pred, labels=labels, average='weighted'),
weighted_macro_f1_score, 'weighted_macro_f1_score')
print('=====================================')
check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average=None),
fbeta_scores, 'fbeta_scores', print_val=False)
check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average='macro'),
macro_fbeta_score, 'macro_fbeta_score')
check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average='micro'),
micro_fbeta_score, 'micro_fbeta_score')
check(fbeta_score(y_true, y_pred, beta=beta, labels=labels, average='weighted'),
weighted_macro_fbeta_score, 'weighted_macro_fbeta_score')
print('=====================================')
check(jaccard_score(y_true, y_pred, labels=labels, average=None),
jaccards, 'jaccards', print_val=False)
check(jaccard_score(y_true, y_pred, labels=labels, average='macro'),
macro_jaccard, 'macro_jaccard')
check(jaccard_score(y_true, y_pred, labels=labels, average='micro'),
micro_jaccard, 'micro_jaccard')
check(jaccard_score(y_true, y_pred, labels=labels, average='weighted'),
weighted_macro_jaccard, 'weighted_macro_jaccard')
print('=====================================')
check(accuracy, micro_recall, 'accuracy == micro_recall')
check(accuracy, micro_precision, 'accuracy == micro_precision')
check(accuracy, micro_f1_score, 'accuracy == micro_f1_score')
check(accuracy, micro_fbeta_score, 'accuracy == micro_fbeta_score')
check(accuracy, weighted_macro_recall, 'accuracy == weighted_macro_recall')
4. 参考
- sklearn.metrics.confusion_matrix
- sklearn.metrics.accuracy_score
- sklearn.metrics.recall_score
- sklearn.metrics.precision_score
- sklearn.metrics.f1_score
- sklearn.metrics.fbeta_score
- sklearn.metrics.jaccard_score
5. 更新记录
- 20211020, 发布
- 20211021, 添加 "归一化混淆矩阵"
6. 版权声明
自由转载-非商用-非衍生-保持署名 (创意共享3.0许可证)