分类


本章简单看一个分类问题的例子,从经典的手写体辨识任务出发

导入数据集

from sklearn.datasets import fetch_mldata
%matplotlib inline

mnist = fetch_mldata("MNIST original")
mnist

{'DESCR': 'mldata.org dataset: mnist-original',
'COL_NAMES': ['label', 'data'],
'target': array([0., 0., 0., ..., 9., 9., 9.]),
'data': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8)}

sklearn提供了很多有用的数据集,fetch得到的数据集有相似的结构:DESCR, data, target

X, y = mnist['data'], mnist['target']
X.shape, y.shape
# 对于每个手写体28×28=784 

((70000, 784), (70000,))

some_digit = X[36000]
some_digit_image = some_digit.reshape(28, 28) # 注意还原成二位维矩阵
plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, 
           interpolation='nearest')
plt.text(2, 4, "label = {}".format(y[36001]), size = 15) 
plt.show()

png

划分数据集

sp = 60000
X_train, X_test, y_train, y_test = X[:sp], X[sp:], y[:sp], y[sp:]
shuffle_index = np.random.permutation(sp)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

二分类器

从一个判断手写体是否是5简单二分类出发
使用随机梯度做二分类

# 调整数据的label为2class
y_train_5 = y_train == 5
y_test_5 = y_test == 5
sgd_clf = SGDClassifier(random_state=42,max_iter=100, tol=None)
sgd_clf.fit(X_train, y_train_5)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', max_iter=100, n_iter=None,
n_jobs=1, penalty='l2', power_t=0.5, random_state=42, shuffle=True,
tol=None, verbose=0, warm_start=False)

sgd_clf.predict([some_digit])

array([ True])

性能评估

手动完成交叉验证测量精度

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train  [train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold   = X_train  [test_index ]
    y_test_fold   = y_train_5[test_index ]

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)

    print(n_correct / len(y_pred))
结果还不赖

0.9646
0.934
0.9517

from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

与sklearn提供的交叉验证计算吻合

array([0.9646, 0.934 , 0.9517])

看起来预测出是5的准确性很高,但是注意我们评估的方法是拿预测对的除以总数

下面看一个非5二分器表现

from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.9062 , 0.90985, 0.9129 ])

可想而知我们都预测是false正确率也有$90\%$, 原因在于只有一小部分正例,引入其他评估方法

混淆矩阵

矩阵元素$(i,j)$ 表示$i$被识别称$j$的数量

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3) # cross_val_predict输出具体预测的结果
y_train_pred

array([False, False, False, ..., False, False, False])

通过比较y_train_5和y_train_pred计算混淆矩阵

from sklearn.metrics import confusion_matrix

_confusion_matrix = confusion_matrix(y_train_5, y_train_pred) # confusion_matrix(真实label, 预测)

_confusion_matrix

array([[52421, 2158],
[ 836, 4585]])

$(0,0)=52421$ 非$5$被识别成非$5$ 真反例 $true\ positives $
$(0,1)=2158$ 非$5$被识别成5 假反例 $false\ positives $
$(1,0)=836$ $5$被识别成非$5$ 假反例 $false\ negatives $
$(1,1)=4585$ $5$被识别成$5$ 真正例 $true\ positives $

更简洁的测度

  1. 正例的精确度,考量的范围不再是全体样本
    $\displaystyle precision = \frac{TP}{TP+FP}$
    做一次正例的判断保证预测正确的概率
  2. 被分类器正确预测为正例的比例 $\displaystyle recall = \frac{TP}{TP+FN}$
    $\begin{bmatrix} TN & FP \ FN & TP\end{bmatrix}$

Precision and Recall

from sklearn.metrics import precision_score, recall_score

print(precision_score(y_train_5, y_train_pred))
print(_confusion_matrix[1][1] / (_confusion_matrix[1][1] + _confusion_matrix[0][1]))

0.6799644075337387
0.6799644075337387

print(recall_score(y_train_5, y_train_pred))
print(_confusion_matrix[1][1] / (_confusion_matrix[1][1] + _confusion_matrix[1][0]))

0.845784910533112
0.845784910533112

$F_1$量度定义为准确和查全率的调和平均数
$\displaystyle F_1 = \frac{2}{\frac{1}{precision}+\frac{1}{recall}}=2 \times \frac{precision \times recall}{precision + recall}$

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

0.7538638605721802

$F_1$量度综合考虑了准确度和查全率,但往往我们需要给他们分配权重,在不同的项目中,我们有自己的偏好。
但实情是当我们提高一方面会削减另一个方面。这称为$precision/recall\ tradeoff$

分类器为其预测基于$decision\ function$打分,如果分数高于阈值归为正例,反之亦然。
选取不同的$decision\ threshold$会影响$precision$和$recall$得分
一般而言提高阈值随之提高了准确率,降低了召回率,反之,降低准确率,提高召回率。
sklearn不提供直接修改阈值,但我们可以通过调用descision_function()达到目的

y_scores = sgd_clf.decision_function([some_digit])
y_scores

array([10971.98721069])

threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

array([ True])

threshold = 20000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

array([False])

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")  
# 指明方法为decision_function,输出具体的得分
y_scores

array([-44655.48002787, -22669.22859416, -51679.93816945, ...,
-21362.51058543, -25077.11036414, -12669.96904769])

基于以上的得分我们可以调整阈值达到不同的准确率和查全率

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores) 
# 预测结果,交叉验证得分来 得到在不同阈值设置下查全率和查准率
# 对于不同的阈值,绘制precisions和recalls曲线
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2) # 查准率
    plt.plot(thresholds, recalls[:-1],"g--", label="Recall", linewidth=2) # 查全率
    plt.xlabel("Threshold", fontsize=16)
    plt.ylim([0, 1])
    plt.legend(loc="upper left", fontsize=16)
plt.figure(figsize=(10, 4))
plt.xlim([-60000, 60000])
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

png

试试直接绘制precision vs recall

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.xlim([0,1])
    plt.ylim([0,1])
    #  plt.axis([0, 1, 0, 1]) # 合并为一句
plot_precision_vs_recall(precisions, recalls)
plt.show()

png
假设我们要选定查准率为90%的阈值
从Precision and recall vs decision threshold上读出大约是5000

y_train_pred_90 = (y_scores > 5000)
precision_score(y_train_5, y_train_pred_90)

0.840681576144835

recall_score(y_train_5, y_train_pred_90)

0.8378527946873271

ROC曲线

$receiver\ operating\ characteristic\ curve$
反应模型的泛化能力,绘制$TPR\ against\ FPR$
$\displaystyle TPR(true\ positive\ rate) = \frac{TP}{TP+FN}$ , $\displaystyle FPR(false\ positive\ rate) = \frac{FP}{FP+TN}$

from sklearn.metrics import roc_curve
fpr, tpr ,thresholds = roc_curve(y_train_5, y_scores)
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0,1],[0,1],'k--')
    plt.axis([0,1,0,1])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
plot_roc_curve(fpr, tpr) # 绘制ROC
plt.show()

png

from sklearn.metrics import roc_auc_score # 计算AUC ROC 面积
roc_auc_score(y_train_5, y_scores)

0.9618734957617371

何时使用ROC与PR

当正例很少或者我们关心$FP$超过$FN$时选用PR curve
反之选择ROC curve

随机森林分类器
注意其工作方式并没与decision_function()显示分数,其使用dict_proba()方法
对于sklearn的分类器都有其中一种可以显示估计器得分的方法

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_proba_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, 
                                   method="predict_proba")
y_proba_forest # 输出两个 label

array([[1., 0.],
[1., 0.],
[1., 0.],
...,
[1., 0.],
[1., 0.],
[1., 0.]])

Multiclass Classification

多类别分类器
随机森林,朴素贝叶斯分类器等都能直接处理多类别,其他的例如支持向量机和线性模型是严格二分类器
然而仍有方法去实现线性模型个多分类。

一种称为$(one-versus-all)OvA$, 对手写体做十个二分类器组合预测
从每个分类器得到该图像的decision score取得分最高作为预测结果
另一种称为(one-versus-one)$OvO$ ,对每对数字做一个分类器,共计需要$\displaystyle\frac{N*(N-1)}{2}$个分类器
预测时从这些分类器中查询结果,这种策略的好处在于每个分类器只需在部分训练数据上跑模型。
对于有些算法例如支持向量机分类器在大数据下表现不优,此时就可以采取OvO方式训练出很多小模型。
对于大部分机器学习算法优先采用OvA方式

sgd_clf.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', max_iter=100, n_iter=None,
n_jobs=1, penalty='l2', power_t=0.5, random_state=42, shuffle=True,
tol=None, verbose=0, warm_start=False)

sgd_clf.predict([some_digit]) # sklearn 对于y_train多class 自动OvA多类分类(除了SVM)

array([5.])

some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores # 实际上sklearn 底下训练了10组分类器

array([[ -7116.15452063, -28102.04823134, -16873.94800459,
-5744.56357699, -25889.4638665 , 10971.98721069,
-37587.90818143, -27744.26015609, -36375.67981573,
-40091.83133088]])

np.argmax(some_digit_scores)

5

sgd_clf.classes_ # 按序存储class

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

sgd_clf.classes_[5]

5.0

强制使用$OvO$跑线性多分类器
OneVsOneClassifierOneVsRestClassifier 类自己构造即可

import warnings 
warnings.filterwarnings('ignore') # 强用ovo会报Further Warning这里忽略下!
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])

array([5.])

len(ovo_clf.estimators_) # 10 * 9 / 2 = 45

45

# 随机森林并没有使用ovo or ova,其可以直接多分类
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])

array([5.])

forest_clf.predict_proba([some_digit])

array([[0.1, 0. , 0. , 0. , 0. , 0.8, 0. , 0.1, 0. , 0. ]])

forest_clf.classes_

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
# array([0.87567487, 0.86234312, 0.87038056])

尝试特征缩放优化

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
array([0.91261748, 0.91069553, 0.91113667])

分析错误

假设我们有了几个较好的模型,现在来分析其错误是什么造成的

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
y_train_pred # 返回预测结果score 返回预测得分

array([4., 6., 2., ..., 9., 7., 7.])

conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx
# 主对角线位为正确的预测

array([[5701, 1, 17, 8, 12, 46, 41, 6, 86, 5],
[ 1, 6471, 47, 25, 6, 47, 4, 11, 117, 13],
[ 36, 32, 5337, 110, 69, 28, 84, 54, 194, 14],
[ 37, 32, 129, 5351, 1, 237, 29, 49, 179, 87],
[ 15, 22, 43, 12, 5350, 7, 42, 28, 129, 194],
[ 49, 28, 37, 175, 62, 4638, 88, 22, 241, 81],
[ 33, 23, 49, 2, 40, 92, 5600, 6, 72, 1],
[ 19, 17, 66, 29, 58, 13, 4, 5773, 39, 247],
[ 30, 96, 57, 120, 8, 148, 43, 15, 5251, 83],
[ 30, 31, 29, 83, 146, 34, 2, 182, 139, 5273]])

看看混淆矩阵形成的灰度图

越亮表示数值越大

plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

png

使用错误率而非绝对错误的数量描述图像

这样便于观察出每行的差异,可视化数字被识别结果概率

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

png

行为实际的class,列为预测的class
观察错误来源$(8, ), (, 8), (5, 3), (3, 5), (7, 9), (9, 7)$
通过混淆矩阵帮我们后期优化致命方向,应该改善8,9与其他数字的区别,调整3,5的预测方式

个体分析手写体3和5

def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = matplotlib.cm.binary, **options)
    plt.axis("off")
cl_a, cl_b = 3, 5
# 手工混淆矩阵
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

# 默认子图命令是subplot(111)
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
plt.show()

png

左侧为预测为a=3, 右侧为预测为b=5

可以看到有些数据就算是人也很难区分,这些糟糕的数据要么剔除,要么做数据预处理把3和5的区分放大,例如把3, 5摆正

Multilabel Classification 多标签分类器

单标签分类器把训练数据扔进一个class中,多标签分类器对任务要扔到多个class中
构造两个标签:>=7 和奇数

y_train_large = (y_train >= 7)
y_train_odd =(y_train % 2 == 1)
y_multiabel = np.c_[y_train_large, y_train_odd]
print("y_train_large: ", y_train_large)
print("y_train_odd: ", y_train_odd)
print("y_multiabel: \n", y_multiabel)

y_train_large: [False False False ... True True True]
y_train_odd: [False False False ... True True True]
y_multiabel:
[[False False]
[False False]
[False False]
...
[ True True]
[ True True]
[ True True]]

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multiabel)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')

knn_clf.predict([some_digit]) # some_digit = 5

array([[False, True]])

评估多标签分类器

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)
hon
f1_score(y_train, y_train_knn_pred, average="macro") 
# average=macro label同等重要, average=‘weighted’ 给加权标签

0.96084403616737991

Multioutput Classifiction 多输出分类器

模拟一个系统输入混乱的图片输出干净的图片

分类器的输出是28*28个像素点,每个像素点又是0~255故该分类器是多输出多标签分类器

# 产生噪声
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise

y_train_mod = X_train
y_test_mod = X_test
def plot_digit(data):
    image = data.reshape(28, 28) # 别忘了reshape成(28*28)
    plt.imshow(image, cmap = plt.cm.binary,
               interpolation="nearest")
    plt.axis("off")
some_index = 5500
plt.subplot(121); plot_digit(X_test_mod[some_index])
plt.subplot(122); plot_digit(y_test_mod[some_index])
plt.show()

png

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)

png

Extra material

伪分类器 提供基线参考

from sklearn.dummy import DummyClassifier
dmy_clf = DummyClassifier()
y_probas_dmy = cross_val_predict(dmy_clr, X_train, y_train_5, cv=3, 
                                 method="predict_proba")

fprr, tprr, thresholdsr = roc_curve(y_train_5, y_scores_dmy)
plot_roc_curve(fprr, tprr)

KNN classifie

Exercises

1. Knn超参优化

knn在此类问题上表现优秀,尝试用网格搜索做超参优化
knn分类器的主要超参:weights和n_neighbors

2. 数据增强 data augmentation

尝试对原有手写体进行旋转,增强/扩充数据集

from scipy.ndimage.interpolation import shift
def shift_image(img, dx, dy):
    img = img.reshape(28, 28)
    shifted_image = shift(img, [dy, dx], cval=0, mode="constant")
    return shifted_image.reshape([-1])
image_set = X_train[233]
print(y_train[233])
shifted_image_down = shift_image(image_set, dx=0, dy=5)
shifted_image_left = shift_image(image_set, dx=-5, dy=0)

plt.figure(figsize=(5,3))

plt.subplot(131)
plt.title("Original", fontsize=14)
plt.imshow(image_set.reshape(28,28), interpolation="nearest", cmap="Greys")

plt.subplot(132)
plt.title("Shifted down", fontsize=14)
plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")

plt.subplot(133)
plt.title("Shifted left", fontsize=14)
plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.show()

0.0

png

X_train_augmented = [img for img in X_train]
y_train_augmented = [img for img in y_train]

for dx, dy in ((1,0), (-1,0), (0,1), (0,-1)):
    for image, label in zip(X_train, y_train):
        X_train_augmented.append(shift_image(image, dx=dx, dy=dy))
        y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)
knn_clf = KNeighborsClassifier()

KISS