3_Classification

In [1]:
from sklearn.datasets import fetch_mldata
import matplotlib
import matplotlib.pylab as plt
import numpy as np
%matplotlib inline

mnist = fetch_mldata("MNIST original")
mnist

Out[1]:
{'DESCR': 'mldata.org dataset: mnist-original',
'COL_NAMES': ['label', 'data'],
'target': array([0., 0., 0., ..., 9., 9., 9.]),
'data': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8)}
In [2]:
X, y = mnist['data'], mnist['target']
X.shape, y.shape
# 对于每个手写体28×28=784

Out[2]:
((70000, 784), (70000,))
In [6]:
some_digit = X[36000]
some_digit_image = some_digit.reshape(28, 28) # 还原成二位维矩阵
plt.imshow(some_digit_image, cmap=matplotlib.cm.binary,
interpolation='nearest')
# plt.axis("off")
plt.text(2, 4, "label = {}".format(y[36001]), size = 15)
plt.show()

In [5]:
sp = 60000
X_train, X_test, y_train, y_test = X[:sp], X[sp:], y[:sp], y[sp:]

In [6]:
import numpy as np

shuffle_index = np.random.permutation(sp)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]


### 二分类器¶

In [7]:
# 调整数据的label为2class
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

In [8]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42,max_iter=100, tol=None)

sgd_clf.fit(X_train, y_train_5)

Out[8]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', max_iter=100, n_iter=None,
n_jobs=1, penalty='l2', power_t=0.5, random_state=42, shuffle=True,
tol=None, verbose=0, warm_start=False)
In [9]:
sgd_clf.predict([some_digit])

Out[9]:
array([ True])

### 性能评估¶

#### 手动完成交叉验证¶

In [10]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
clone_clf = clone(sgd_clf)
X_train_folds = X_train  [train_index]
y_train_folds = y_train_5[train_index]
X_test_fold   = X_train  [test_index ]
y_test_fold   = y_train_5[test_index ]

clone_clf.fit(X_train_folds, y_train_folds)
y_pred = clone_clf.predict(X_test_fold)
n_correct = sum(y_pred == y_test_fold)

print(n_correct / len(y_pred))

0.96785
0.9548
0.9629

In [11]:
from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Out[11]:
array([0.96785, 0.9548 , 0.9629 ])

#### 看起来预测出是5的准确性很高，但是注意我们评估的方法是拿 预测对的除以总数¶

In [12]:
from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
def fit(self, X, y=None):
pass
def predict(self, X):
return np.zeros((len(X), 1), dtype=bool)

In [13]:
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Out[13]:
array([0.9078 , 0.9093 , 0.91185])

### 混淆矩阵¶

In [14]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3) # cross_val_predict输出具体预测的结果

In [15]:
y_train_pred

Out[15]:
array([ True,  True, False, ..., False, False, False])

### 通过比较y_train_5和y_train_pred计算混淆矩阵¶

In [16]:
from sklearn.metrics import confusion_matrix

_confusion_matrix = confusion_matrix(y_train_5, y_train_pred) # confusion_matrix(真实label, 预测)

In [17]:
_confusion_matrix

Out[17]:
array([[54247,   332],
[ 1957,  3464]])
$(0,0)=54055$ 非$5$被识别成非$5$ 真反例 $true$positives$ $(0,1)=524$ 非$5$被识别成5 假反例 $false$ positives$
$(1,0)=1449$ $5$被识别成非$5$ 假反例 $false$negatives$ $(1,1)=3972$ $5$被识别成$5$ 真正例 $true$ positives$

#### 更简洁的测度¶

1. 正例的精确度，考量的范围不再是全体样本
$$displaystyle precosion =$frac{TP}{TP+FP}$
做一次正例的判断保证预测正确的概率
2. 被分类器正确预测为正例的比例 $$displaystyle recall =$frac{TP}{TP+FN}$
$$begin{bmatrix} TN & FP$$FN & TP$end{bmatrix}$

### Precision and Recall¶

In [18]:
from sklearn.metrics import precision_score, recall_score

print(precision_score(y_train_5, y_train_pred))
print(_confusion_matrix[1][1] / (_confusion_matrix[1][1] + _confusion_matrix[0][1]))

0.9125395152792413
0.9125395152792413

In [19]:
print(recall_score(y_train_5, y_train_pred))
print(_confusion_matrix[1][1] / (_confusion_matrix[1][1] + _confusion_matrix[1][0]))

0.6389964951116031
0.6389964951116031

$F_1$量度定义为准确和查全率的调和平均数
$$displaystyle F_1 =$frac{2}{$frac{1}{precision}+$frac{1}{recall}}=2 $times$frac{precision $times recall}{precision + recall}$ In [20]: from sklearn.metrics import f1_score f1_score(y_train_5, y_train_pred)  Out[20]: 0.751654551372464 $F_1$量度综合考虑了准确度和查全率，但往往我们需要给他们分配权重，在不同的项目中，我们有自己的偏好。 但实情是当我们提高一方面会削减另一个方面。这称为$precision/recall$ tradeoff$

sklearn不提供直接修改阈值，但我们可以通过调节descision_function()达到目的

In [21]:
y_scores = sgd_clf.decision_function([some_digit])
y_scores

Out[21]:
array([3981.37772237])
In [22]:
threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

Out[22]:
array([ True])
In [23]:
threshold = 20000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

Out[23]:
array([False])
In [24]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")  # 输出具体的得分
y_scores

Out[24]:
array([ 14344.87633729,   7567.82140676, -22506.58001148, ...,
-15625.04015326, -56374.90287504, -22612.94742342])

In [25]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
# 预测结果,交叉验证得分来 得到在不同阈值设置下查全率和查准率

In [26]:
# 对于不同的阈值，绘制precisions和recalls曲线
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2) # 查准率
plt.plot(thresholds, recalls[:-1],"g--", label="Recall", linewidth=2) # 查全率
plt.xlabel("Threshold", fontsize=16)
plt.ylim([0, 1])
plt.legend(loc="upper left", fontsize=16)

In [27]:
plt.figure(figsize=(10, 4))
plt.xlim([-60000, 60000])
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

In [28]:
#### 试试直接绘制precision vs recall
def plot_precision_vs_recall(precisions, recalls):
plt.plot(recalls, precisions, "b-", linewidth=2)
plt.xlabel("Recall", fontsize=16)
plt.ylabel("Precision", fontsize=16)
plt.xlim([0,1])
plt.ylim([0,1])
#  plt.axis([0, 1, 0, 1]) # 合并为一句

In [29]:
plot_precision_vs_recall(precisions, recalls)
plt.show()

In [30]:
y_train_pred_90 = (y_scores > 5000)
precision_score(y_train_5, y_train_pred_90)

Out[30]:
0.9553962564715253
In [31]:
recall_score(y_train_5, y_train_pred_90)

Out[31]:
0.4425382770706512

$receiver$operating$ characteristic$curve$ 反应模型的泛化能力，绘制$TPR$ against$FPR$ $$displaystyle TPR(true$positive$ rate) = $frac{TP}{TP+FN}$ , $$displaystyle FPR(false$positive$ rate) = $frac{FP}{FP+TN}$ In [32]: from sklearn.metrics import roc_curve fpr, tpr ,thresholds = roc_curve(y_train_5, y_scores)  In [33]: def plot_roc_curve(fpr, tpr, label=None): plt.plot(fpr, tpr, linewidth=2, label=label) plt.plot([0,1],[0,1],'k--') plt.axis([0,1,0,1]) plt.xlabel("False Positive Rate") plt.ylabel("True Positive Rate")  In [34]: plot_roc_curve(fpr, tpr) # 绘制ROC plt.show()  In [35]: from sklearn.metrics import roc_auc_score # 计算AUC ROC 面积 roc_auc_score(y_train_5, y_scores)  Out[35]: 0.9642974532846397 #### 何时使用ROC与PR¶ 当正例很少或者我们关心$FP$超过$FN$时选用PR curve 想想 反之选择ROC curve 随机森林分类器 注意其工作方式并没与decision_function()显示分数，其使用dict_proba()方法 对于sklearn的分类器都有其中一种可以显示估计器得分的方法 In [36]: from sklearn.ensemble import RandomForestClassifier forest_clf = RandomForestClassifier(random_state=42) y_proba_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")  In [37]: y_proba_forest # 输出两个 label  Out[37]: array([[0. , 1. ], [0.7, 0.3], [0.9, 0.1], ..., [1. , 0. ], [1. , 0. ], [1. , 0. ]]) ### Multiclass Classification¶ 多类别分类器 随机森林，朴素贝叶斯分类器等都能直接处理多类别，其他的例如支持向量机和线性模型是严格二分类器 然而仍有方法去实现线性模型个多分类。 一种称为$(one-versus-all)OvA$, 对手写体做十个二分类器组合预测 从每个分类器得到该图像的decision score取得分最高作为预测结果 另一种称为(one-versus-one)$OvO$ ，对每对数字做一个分类器,共计需要$$displaystyle$frac{N*(N-1)}{2}$个分类器 预测时从这些分类器中查询结果，这种策略的好处在于每个分类器只需在部分训练数据上跑模型。 对于有些算法例如支持向量机分类器在大数据下表现不优，此时就可以采取OvO方式训练出很多小模型。 对于大部分机器学习算法优先采用OvA方式 In [38]: sgd_clf.fit(X_train, y_train)  Out[38]: SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=100, n_iter=None, n_jobs=1, penalty='l2', power_t=0.5, random_state=42, shuffle=True, tol=None, verbose=0, warm_start=False) In [39]: sgd_clf.predict([some_digit]) # sklearn 对于y_train多class 自动OvA多类分类(除了SVM)  Out[39]: array([5.]) In [40]: some_digit_scores = sgd_clf.decision_function([some_digit])  In [41]: some_digit_scores # 实际上sklearn 底下训练了10组分类器  Out[41]: array([[ -8767.1583592 , -28434.23639605, -15375.96926209, -15053.8624407 , -26689.81636005, 3981.37772237, -43076.93474333, -25584.21102096, -29794.24807625, -31170.63859883]]) In [42]: np.argmax(some_digit_scores)  Out[42]: 5 In [43]: sgd_clf.classes_ # 按序存储class  Out[43]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) In [44]: sgd_clf.classes_[5]  Out[44]: 5.0 强制使用$OvO$跑线性多分类器 OneVsOneClassifierOneVsRestClassifier 类自己构造即可 In [45]: import warnings warnings.filterwarnings('ignore') # 强用ovo会报Further Warning这里忽略下！ from sklearn.multiclass import OneVsOneClassifier ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42)) ovo_clf.fit(X_train, y_train) ovo_clf.predict([some_digit])  Out[45]: array([5.]) In [46]: len(ovo_clf.estimators_) # 10 * 9 / 2 = 45  Out[46]: 45 In [47]: # 随机森林并没有使用ovo or ova，其可以直接多分类 forest_clf.fit(X_train, y_train) forest_clf.predict([some_digit])  Out[47]: array([5.]) In [48]: forest_clf.predict_proba([some_digit])  Out[48]: array([[0.3, 0. , 0. , 0.1, 0. , 0.6, 0. , 0. , 0. , 0. ]]) In [49]: forest_clf.classes_  Out[49]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) In [50]: cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy") # array([0.87567487, 0.86234312, 0.87038056])  Out[50]: array([0.87747451, 0.87954398, 0.88413262]) 尝试特征缩放 优化 In [51]: from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))  In [52]: cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy") # array([0.91261748, 0.91069553, 0.91113667])  Out[52]: array([0.91276745, 0.91219561, 0.91293694]) ### 分析错误¶ 假设我们有了几个较好的模型，现在来分析其错误是什么造成的 In [53]: y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)  In [54]: y_train_pred # 返回预测结果score 返回预测得分  Out[54]: array([5., 5., 0., ..., 9., 2., 1.]) In [55]: conf_mx = confusion_matrix(y_train, y_train_pred)  In [56]: conf_mx # 主对角线位为正确的预测  Out[56]: array([[5694, 1, 16, 8, 11, 50, 44, 6, 88, 5], [ 1, 6462, 45, 27, 5, 43, 4, 11, 129, 15], [ 39, 33, 5341, 99, 77, 26, 76, 46, 208, 13], [ 38, 30, 132, 5360, 2, 222, 29, 50, 179, 89], [ 17, 19, 44, 13, 5343, 8, 50, 28, 121, 199], [ 46, 27, 37, 185, 60, 4622, 91, 23, 240, 90], [ 30, 23, 45, 2, 45, 94, 5599, 7, 72, 1], [ 23, 15, 71, 30, 50, 9, 4, 5795, 37, 231], [ 30, 94, 52, 116, 6, 128, 43, 14, 5281, 87], [ 29, 28, 30, 83, 146, 38, 2, 192, 140, 5261]]) In [57]: plt.matshow(conf_mx, cmap=plt.cm.gray) plt.show()  In [58]: #### 使用错误率而非绝对错误的数量描述图像 row_sums = conf_mx.sum(axis=1, keepdims=True) norm_conf_mx = conf_mx / row_sums  In [59]: np.fill_diagonal(norm_conf_mx, 0) plt.matshow(norm_conf_mx, cmap=plt.cm.gray) plt.show()  行为实际的class，列为预测的class 观察错误来源$(8, *), (*, 8), (5, 3), (3, 5), (7, 9), (9, 7)$ 通过混淆矩阵帮我们后期优化致命方向，应该改善8,9与其他数字的区别，调整3,5的预测方式 个体分析3,5 In [60]: def plot_digits(instances, images_per_row=10, **options): size = 28 images_per_row = min(len(instances), images_per_row) images = [instance.reshape(size,size) for instance in instances] n_rows = (len(instances) - 1) // images_per_row + 1 row_images = [] n_empty = n_rows * images_per_row - len(instances) images.append(np.zeros((size, size * n_empty))) for row in range(n_rows): rimages = images[row * images_per_row : (row + 1) * images_per_row] row_images.append(np.concatenate(rimages, axis=1)) image = np.concatenate(row_images, axis=0) plt.imshow(image, cmap = matplotlib.cm.binary, **options) plt.axis("off")  In [61]: cl_a, cl_b = 3, 5 # 手工混淆矩阵 X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)] X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)] X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)] X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)] # 默认子图命令是subplot(111) plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5) plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5) plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5) plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5) plt.show()  In [62]: # 左侧为预测为a=3, 右侧为预测为b=5  ### Multilabel Classification 多标签分类器¶ 单标签分类器把训练数据扔进一个class中，多标签分类器对任务要扔到多个class中 In [63]: y_train_large = (y_train >= 7) y_train_odd =(y_train % 2 == 1) y_multiabel = np.c_[y_train_large, y_train_odd]  In [64]: print("y_train_large: ", y_train_large) print("y_train_odd: ", y_train_odd) print("y_multiabel:$n", y_multiabel)

y_train_large:  [False False  True ...  True False False]
y_train_odd:  [ True  True False ...  True False  True]
y_multiabel:
[[False  True]
[False  True]
[ True False]
...
[ True  True]
[False False]
[False  True]]

In [65]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multiabel)

Out[65]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
In [66]:
knn_clf.predict([some_digit]) # some_digit = 5

Out[66]:
array([[False,  True]])
In [72]:
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)

In [73]:
import pickle
with open("../y_train_knn_pred.pk", "rb") as f:
y_train_knn_pred = pickle.load(f)

In [74]:
f1_score(y_train, y_train_knn_pred, average="macro") # average=macro label同等重要， weighted加权

Out[74]:
0.09778545692602647
In [75]:
import pickle

with open('../y_train_knn_pred.pk', 'wb') as f:
#     np.ndarray.dump(y_train_knn_pred, f)
pickle.dump(y_train_knn_pred, f)


### Multioutput Classifiction 多输出分类器¶

#### 模拟一个系统输入混乱的图片输出干净的图片¶

In [76]:
### 产生噪声

noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise

y_train_mod = X_train
y_test_mod = X_test

In [77]:
def plot_digit(data):
image = data.reshape(28, 28) # 别忘了reshape成(28*28)
plt.imshow(image, cmap = plt.cm.binary,
interpolation="nearest")
plt.axis("off")

In [78]:
some_index = 5500
plt.subplot(121); plot_digit(X_test_mod[some_index])
plt.subplot(122); plot_digit(y_test_mod[some_index])
plt.show()

In [79]:
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)


### Extra material¶

#### 伪分类器 提供基线参考¶

In [80]:
from sklearn.dummy import DummyClassifier # 随机分类器
dmy_clf = DummyClassifier()
y_probas_dmy = cross_val_predict(dmy_clf, X_train, y_train, cv=3, method="predict_proba")
y_scores_dmy = y_probas_dmy[:,1]

In [81]:
fprr, tprr, thresholdsr = roc_curve(y_train_5, y_scores_dmy)
plot_roc_curve(fprr, tprr)


### KNN classifie¶

In [ ]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_jobs=-1, weights="distance", n_neighbors=4)
knn_clf.fit(X_train, y_train)


## Exercises¶

### 1. Knn超参优化¶

knn在此类问题山表现优秀，尝试用网格搜索做超参优化
knn分类器的主要超参：weights和n_neighbors

### 2. 数据增强 data augmentation¶

In [ ]:
from scipy.ndimage.interpolation import shift

def shift_image(img, dx, dy):
img = img.reshape(28, 28)
shifted_image = shift(img, [dy, dx], cval=0, mode="constant")
return shifted_image.reshape([-1])

In [ ]:
image_set = X_train[233]
print(y_train[233])
shifted_image_down = shift_image(image_set, dx=0, dy=5)
shifted_image_left = shift_image(image_set, dx=-5, dy=0)

plt.figure(figsize=(5,3))

plt.subplot(131)
plt.title("Original", fontsize=14)
plt.imshow(image_set.reshape(28,28), interpolation="nearest", cmap="Greys")

plt.subplot(132)
plt.title("Shifted down", fontsize=14)
plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")

plt.subplot(133)
plt.title("Shifted left", fontsize=14)
plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.show()

In [ ]:
X_train_augmented = [img for img in X_train]
y_train_augmented = [img for img in y_train]

In [ ]:
for dx, dy in ((1,0), (-1,0), (0,1), (0,-1)):
for image, label in zip(X_train, y_train):
X_train_augmented.append(shift_image(image, dx=dx, dy=dy))
y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

In [ ]:
knn_clf = KNeighborsClassifier()


This site uses Akismet to reduce spam. Learn how your comment data is processed.