### 数据准备

1. 全局概览
2. 获取数据
3. 探索和可视化数据
4. 为机器学习提供数据
5. 选择一个模型并训练其
6. 微调模型(fine-tune)
7. 呈现解决方案
8. 上线,监视,维持系统

### #Frame the Problem

$$displaystyle RMSE($mathbf{X},h)=$sqrt{$frac{1}{m}$sum_{i=1}^{m}(h($mathbf{x}^{(i)})-y^{(i)})^2}$

$68$%$的数据低于$$sigma$
$95$%%$的数据低于$2$sigma$
$99.7$%$的数据低于3$$sigma$

$$displaystyle MAE($mathbf{X},h)=$frac{1}{m}$sum_{i=1}^{m}|h($mathbf{x}^{(i)})-y^{(i)}|$ $norms$与损失函数 $RMSE$对应向量的欧几里得距离($Euclidian$ norms$) 叫做$$mathcal{$ell_2} $norms$记做$||$cdot||_2$或者$||$cdot||$ $MAE$对用曼哈顿距离($Manhattan$ norms$)记做$||$cdot||_1$ 更一般的我们定义向量的范数$||$ $mathbf{v}$ ||_k=(|$nu_0|^k+|$nu_1|^k+ $cdot$cdot $cdot$cdot + |$nu_n|^k)^{$frac{1}{k}}$

import os
import tarfile
import urllib.request
import pandas as pd
import hashlib
import numpy as np
%matplotlib inline
import matplotlib.pylab as plt

# 数据根目录
# 本地存放相对路径
HOUSING_PATH = 'datasets/housing/'
HOUSING_ABSOLUTE_PATH = '/home/hu/ml/Handson-ml/datasets/housing'
# 文件名
HOUSING_TGZ = 'housing.tgz'
HOUSING_CSV = 'housing.csv'
# 构造数据网络地址

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path): # 创建数据文件夹
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, HOUSING_TGZ)
urllib.request.urlretrieve(housing_url, tgz_path)  # 调用urllib,下载tgz文件

housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)  # 解压路径
housing_tgz.close()

# fetch_housing_data()

# 载入csv数据
csv_path = os.path.join(HOUSING_PATH, HOUSING_CSV) # 构造绝对路径

housing = load_housing_data() # 导入CSV数据


# 查看Df数据信息头
housing.info()
# 观察total_bedrooms项  20433 non-null float64
# 存在缺失项
# 观察ocean_proximity   20640 non-null object
# 存在非数值项


<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

housing['ocean_proximity'].value_counts() # 查看特定属性的值统计情况


<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64

housing.describe() # 查看数值项的统计


# 通过直方图感受数据分布
housing.hist(bins=100, figsize=(20,15)) # 参数: 样本数, 显示尺寸
plt.show()


1. 平均收入数据被按比例缩放到$[0.499, 15.0001]$
2. 房屋的平均年龄和售价均被封顶($capped$), 会造成训练的模型预测被封顶
3. 特征缩放不一致($feature $scaling$) 4. 直方图头重脚轻, 最好的训练数据应是钟型分布($bell$ shape$)

print('median_income max:',housing['median_income'].max())
print('median_income min:', housing['median_income'].min())


median_income max: 15.0001
median_income min: 0.4999

print('median_house_value max:', housing['median_house_value'].max())
print('median_house_value min:', housing['median_house_value'].min())


median_house_value max: 500001.0
median_house_value min: 14999.0

### 创建测试集

# 直接随机索引
shuffled_indices = np.random.permutation(len(data))
test_indices = shuffled_indices[:test_data_size]
train_indices = shuffled_indices[test_data_size:]
return data.iloc[train_indices], data.iloc[test_indices] # iloc:默认整数索引 loc:按自定index
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), 'train +', len(test_set), 'test')


16512 train + 4128 test

#### 直接随机索引缺点

def test_set_check(identifier, test_radio, hash):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_radio

ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_radio, hash))
return data.loc[~in_test_set], data.loc[in_test_set] # loc根据自建索引 返回训练集,数据集

housing_with_id = housing.reset_index() # 重建整数行索引 第一列为index
housing_with_id.info()


<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 11 columns):
index 20640 non-null int64
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), int64(1), object(1)
memory usage: 1.7+ MB

train_set, test_set = split_train_by_id(housing_with_id, 0.2, 'index')
print(len(train_set), 'train +', len(test_set), 'test')


16362 train + 4278 test

#### 除了使用reset_index建立行索引, 还可以自己建立一个独特的标签 比如经度与维度组合

housing_with_id['id'] = housing['longitude'] * 1000 + housing['latitude']
housing_with_id.info()
train_set, test_set = split_train_by_id(housing_with_id, 0.2, 'id')


<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 12 columns):
index 20640 non-null int64
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
id 20640 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 1.9+ MB

# 使用sklearn自带方法划分训练/测试集
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2,random_state=42)
print(len(train_set), 'train +', len(test_set), 'test')


16512 train + 4128 test

### 分层采样 ($stratified $sampling$) 如果数据规模很大可以采取上述的方法采样, 但对于小规模数据, 这种随机采样很大概率会造成采样偏差($sampling$ bias$)

housing["median_income"].hist() # 展示某一属性的直方图


housing['income_cat'] = np.ceil(housing['median_income'] / 1.5)
housing['income_cat'].where(housing['income_cat'] < 5, 5.0, inplace=True) # 保留收入＜5, 层数不能太多
# where的参数解释:
# 一 表达式, cond为True时保留,为False时替换
# 二 other 为替换值
# 三 inplace,默认False,是否替换数据
# 四 axis=None,对齐轴
# 五 level=None,对齐级别

housing['income_cat'].hist() # 分层后的收入直方图已经符合正态分布


from sklearn.model_selection import StratifiedShuffleSplit

# 分层随机采样分割
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42) # 构造
# StratifiedShuffleSplit 参数解释:
# 1. n_splits 训练/测试 对数(default=10)
# 2. 组数 test_size=None,train_size=None
# 3. random_state=None
for train_index, test_index in split.split(housing, housing['income_cat']):
stra_train_set, stra_test_set = housing.loc[train_index], housing.loc[test_index]

for set_ in (stra_train_set, stra_test_set):
set_.drop(['income_cat'],axis=1,inplace=True) # 从原数据中删除income_cat列

stra_test_set.describe()


housing['income_cat'].value_counts() / len(housing['income_cat'])


3.0 0.350581
2.0 0.318847
4.0 0.176308
5.0 0.114438
1.0 0.039826
Name: income_cat, dtype: float64

housing['income_cat'].value_counts()


3.0 7236
2.0 6581
4.0 3639
5.0 2362
1.0 822
Name: income_cat, dtype: int64

### 可视化数据

housing = stra_train_set.copy()
housing.plot(kind='scatter', x='longitude', y='latitude') # 可视化地理信息


housing.plot(kind='scatter', x='longitude', y='latitude',alpha=0.1) # 增加透明度alpha


housing.plot(kind='scatter',x='longitude', y='latitude',
alpha=0.4, s=housing['population']/100,
label='population', c='median_house_value',
cmap=plt.get_cmap('jet'), colorbar=True)
# plot参数解释
# kind 绘图类型
# alpha 透明度
# s 圆的直径
# label 圆的标签
# c 右侧图示颜色表采样数据
# cmap 图示颜色表类型
# colorbar 色标
plt.legend() # 其他图例


### 寻找关联性

corr_maxtrix = housing.corr() # 构造关联矩阵

corr_maxtrix['median_house_value'].sort_values(ascending=False)
# median_income 最相关


median_house_value 1.000000
median_income 0.687160
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724
Name: median_house_value, dtype: float64

# 另一种方法,采取散点矩阵
from pandas.plotting import scatter_matrix

# 选择有可能相关的几个特征
attributes = ['median_house_value', 'median_income', 'total_rooms','housing_median_age']
scatter_matrix(housing[attributes], figsize=(12, 8))
# 观察可见median_income相关度较高


# 选取一个最有前景的特征: 平均收入
housing.plot(kind='scatter', x='median_income', y='median_house_value',alpha=0.1)
# 数据点集中, 呈上升趋势, 有封顶
# 仔细发现存在异常数据形成水平线可能会影响模型, 剔除这些怪癖数据(data quirks)


### 尝试组合属性

# 组合产生新特征
housing['rooms_pre_household'] = housing['total_rooms'] / housing['households']
housing['bedrooms_pre_room'] = housing['total_bedrooms'] / housing['total_rooms']
housing['population_pre_household'] = housing['population'] / housing['households']

corr_maxtrix = housing.corr()
corr_maxtrix['median_house_value'].sort_values(ascending=False)


median_house_value 1.000000
median_income 0.687160
rooms_pre_household 0.146285
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population_pre_household -0.021985
population -0.026920
longitude -0.047432
latitude -0.142724
bedrooms_pre_room -0.259984
Name: median_house_value, dtype: float64

### 为机器学习算法准备数据

# 从训练集里剔除标记
housing = stra_train_set.drop('median_house_value', axis=1) # drop创建副本
housing_labels = stra_train_set['median_house_value'].copy()
housing.info()
housing_labels.describe()


<class ‘pandas.core.frame.DataFrame’>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 9 columns):
longitude 16512 non-null float64
latitude 16512 non-null float64
housing_median_age 16512 non-null float64
total_rooms 16512 non-null float64
total_bedrooms 16354 non-null float64
population 16512 non-null float64
households 16512 non-null float64
median_income 16512 non-null float64
ocean_proximity 16512 non-null object
dtypes: float64(8), object(1)
memory usage: 1.3+ MB
count 16512.000000
mean 206990.920724
std 115703.014830
min 14999.000000
25% 119800.000000
50% 179500.000000
75% 263900.000000
max 500001.000000
Name: median_house_value, dtype: float64

### 数据清洗

1. 丢弃相应的行 housing.dropna[subset=["total_bedrooms"]
2. 丢弃整个属性 housing.drop("tot_bedrooms", axis=1)
3. 填充特定的值 housing["tot_bedroom"].fillna(median)

# 处理缺失项
# total_bedrooms 16354 non-null float64
from sklearn.preprocessing import Imputer # 创建实例

imputer = Imputer(strategy='median') # 用特征列的中位数填补缺失项, 构造transformer
housing_num = housing.drop('ocean_proximity', axis=1) # 首先丢弃非数值列
imputer.fit(housing_num) # 计算中位数并保存在statistic_中
imputer.statistics_ # 各数值项的均值


array([ -118.51, 34.26, 29.,2119.5, 433.,1164., 408., 3.5409])

housing_num.median().values # 这样可以计算每一属性的均值


array([ -118.51, 34.26, 29.,2119.5, 433.,1164., 408., 3.5409])

X = imputer.transform(housing_num) # 用"训练"好的impter去transform包含数值的列, 返回的是ndarray
print(type(X)) # 数据类型发生变化
housing_tr = pd.DataFrame(X, columns=housing_num.columns) # 构造回df类型
housing_tr.info()


<class ‘numpy.ndarray’>
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 16512 entries, 0 to 16511
Data columns (total 8 columns):
longitude 16512 non-null float64
latitude 16512 non-null float64
housing_median_age 16512 non-null float64
total_rooms 16512 non-null float64
total_bedrooms 16512 non-null float64
population 16512 non-null float64
households 16512 non-null float64
median_income 16512 non-null float64
dtypes: float64(8)
memory usage: 1.0 MB

### 处理文本和明确的属性

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder() # 文本标签编码器
housing_cat = housing['ocean_proximity']
housing_cat_encoder = encoder.fit_transform(housing_cat) # fit then transform
housing_cat_encoder # 文本项编码


array([0, 0, 4, …, 1, 0, 3])

encoder.classes_ # 查看mapping 对应项


array([‘<1H OCEAN’, ‘INLAND’, ‘ISLAND’, ‘NEAR BAY’, ‘NEAR OCEAN’], dtype=object)

from sklearn.preprocessing import OneHotEncoder # 启用独热编码 标签二进制独立，类似格雷码形式

print('原始唯独: ', housing_cat_encoder.shape)
print('reshape维度: ', housing_cat_encoder.reshape(-1,1).shape)
print('原始array :', housing_cat_encoder)
print('after reshape: $n', housing_cat_encoder.reshape(-1,1)) encoder = OneHotEncoder() # 独热码编码器 注意fit_transform 接受2D array housing_cat_1hot = encoder.fit_transform(housing_cat_encoder.reshape(-1,1)) # 输出的疏表（sparse matrix） # reshape中的-1用来填充维度 housing_cat_1hot  原始唯独: (16512,) reshape维度: (16512, 1) 原始array : [0 0 4 …, 1 0 3] after reshape: [[0] [0] [4] …, [1] [0] [3]] <16512x5 sparse matrix of type '<class 'numpy.float64'>' with 16512 stored elements in Compressed Sparse Row format>  housing_cat_1hot.toarray() # 稀疏表2ndarray  array([[ 1., 0., 0., 0., 0.], [ 1., 0., 0., 0., 0.], [ 0., 0., 0., 0., 1.], …, [ 0., 1., 0., 0., 0.], [ 1., 0., 0., 0., 0.], [ 0., 0., 0., 1., 0.]]) # 使用sklearn标签二进制化 from sklearn.preprocessing import LabelBinarizer # 标签二进制化 encoder = LabelBinarizer(sparse_output=False) # 控制输出类型 稀疏表/array default=false housing_cat_1hot = encoder.fit_transform(housing_cat) # 传入一维即可 返回array housing_cat_1hot  array([[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [0, 0, 0, 0, 1], …, [0, 1, 0, 0, 0], [1, 0, 0, 0, 0], [0, 0, 0, 1, 0]]) ### 自定义转换器 由于sklearn的函数是鸭子类型因此很好编写自己的装换方法，不使用继承的情况下使用了多态。 创建类，实现三个方法：fit(), transform(), fit_transfom() fit_transform() 可以通过继承TransformerMixin()实现 此外我们还可以通过继承BaseEstimator得到两个额外的方法：get_params()和set_params() 用来做自动化的超惨优化 下面是一个之前提到的特征组合使用自定义转换器实现 tmp = housing.values # 一位数组切片 [start:end:step] # 高维数组切片 [:,:,:], [:, ...] # numpy.c_[] 将切片对象沿第二个轴(按列)转换为连接 print(np.c_[np.array([1, 2, 3]), np.array([4, 5, 6])]) print(np.c_[np.array([[1, 2], [3, 4], [4, 5]]),np.array([[6, 7], [8, 9], [10, 11]])]) print(np.c_[np.array([[1, 2, 3]]), np.array([[0]]), np.array([[0]]), np.array([[4, 5, 6]])] ) print(np.c_[np.array([[1, 2, 3]]), 0, 0, np.array([[4, 5, 6]])] )  [[1 4] [2 5] [3 6]] [[ 1 2 6 7] [ 3 4 8 9] [ 4 5 10 11]] [[1 2 3 0 0 4 5 6]] [[1 2 3 0 0 4 5 6]] from sklearn.base import BaseEstimator, TransformerMixin room_ix, bedroom_ix, population_ix, household_ix = 3, 4, 5, 6 # 假设我们对这四个特征感兴趣，想要组合形成新的特征 class CombinedAttributesAdder(BaseEstimator, TransformerMixin): # 通过鸭子类型实现自己的转换器 def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs self.add_bedrooms_per_room = add_bedrooms_per_room def fit(self, X, y=None): return self def transform(self, X, y=None): rooms_per_household = X[:,room_ix] / X[:,household_ix] population_per_household = X[:,population_ix] / X[:,household_ix] if(self.add_bedrooms_per_room): # 是否需要组合形成bedroom_pre_hoom bedroom_per_room = X[:,bedroom_ix] / X[:,room_ix] return np.c_[X, rooms_per_household, population_per_household, bedroom_per_room] else: return np.c_[X, rooms_per_household, population_per_household] attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False) # 初始化转化器，此参数属于超参数用来判断属性组合是否有效用 housing_extra_attribs = attr_adder.transform(housing.values) print(housing.values) print("") print(housing_extra_attribs) housing_extra_attribs = pd.DataFrame(housing_extra_attribs, columns=list(housing.columns)+["rooms_per_household", "population_per_household"]) housing_extra_attribs.head()  [[-121.89 37.29 38.0 …, 339.0 2.7042 ‘<1H OCEAN’] [-121.93 37.05 14.0 …, 113.0 6.4214 ‘<1H OCEAN’] [-117.2 32.77 31.0 …, 462.0 2.8621 ‘NEAR OCEAN’] …, [-116.4 34.09 9.0 …, 765.0 3.2723 ‘INLAND’] [-118.01 33.82 31.0 …, 356.0 4.0625 ‘<1H OCEAN’] [-122.45 37.77 52.0 …, 639.0 3.575 ‘NEAR BAY’]] [[-121.89 37.29 38.0 …, ‘<1H OCEAN’ 4.625368731563422 2.094395280235988] [-121.93 37.05 14.0 …, ‘<1H OCEAN’ 6.008849557522124 2.7079646017699117] [-117.2 32.77 31.0 …, ‘NEAR OCEAN’ 4.225108225108225 2.0259740259740258] …, [-116.4 34.09 9.0 …, ‘INLAND’ 6.34640522875817 2.742483660130719] [-118.01 33.82 31.0 …, ‘<1H OCEAN’ 5.50561797752809 3.808988764044944] [-122.45 37.77 52.0 …, ‘NEAR BAY’ 4.843505477308295 1.9859154929577465]] ### 特征缩放 数据归一化(Normalization) 两种方式去缩放特征 1. $Min-Max$ Scaling(normalization)$$ $$displaystyle z=$frac{x_i-min}{max-min}$, $z$in [0,1]$
sklearn 提供超参数feather_range制定数值范围，如果不想以0-1为范围
2. $Standardization$ $$displaystyle z=$frac{x_i-$mu}{$delta}$, ($$mu 均值,$delta 方差$)

sklearn 提供StandardScaler来标准化

### 转换流水线Pipeline

Pipeline构造函数接受一系列的name/估计器fit_transform()，但要保证最后一个估计器必须是转换器,实现fit()功能

list(housing_num)
print(list({"one":"1", "two":"2"}))
# list(可迭代) 返回可迭代项的键集合


[‘one’, ‘two’]

class DataFrameSelector(BaseEstimator, TransformerMixin): # 返回数值ndarray
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self,X,y=None): # 如果fit啥也不做相当于调用了fit_transform()
return self
def transform(self, X):
return X[self.attribute_names].values

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing.label import LabelBinarizerPipelineFriendly
from sklearn.pipeline import FeatureUnion

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([ # 数值项流水线
('selector', DataFrameSelector(num_attribs)), # 先实现选择df类型估计器用fit(),+ transform()返回数值项
('imputer', Imputer(strategy='median')), # Imputer调用fit()传入选择后的数值项调用transform()，返回ndarray
('std_scaler', StandardScaler()), # 接受ndarry标准化 fit()下，不必转换了
])

cat_pipeline = Pipeline([ # 文本项流水线
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizerPipelineFriendly()), # LabelBinarizer和pipeline 传参冲突
])

full_pipeline = FeatureUnion(transformer_list=[ # 组合流水线
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline),
])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared


array([[-1.15604281, 0.77194962, 0.74333089, …, 0. ,
0. , 0. ],
[-1.17602483, 0.6596948 , -1.1653172 , …, 0. ,
0. , 0. ],
[ 1.18684903, -1.34218285, 0.18664186, …, 0. ,
0. , 1. ],
…,
[ 1.58648943, -0.72478134, -1.56295222, …, 0. ,
0. , 0. ],
[ 0.78221312, -0.85106801, 0.18664186, …, 0. ,
0. , 0. ],
[-1.43579109, 0.99645926, 1.85670895, …, 0. ,
1. , 0. ]])

### 选择和训练模型

#### 训练和在训练集上评估

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

some_data = housing.iloc[:5] # 取后五个数据点测试
some_label = housing_labels[:5]
some_data


some_label


17606 286600.0
18632 340600.0
14650 196900.0
3230 46300.0
3555 254500.0
Name: median_house_value, dtype: float64

some_data_prepared = full_pipeline.transform(some_data)
print("predictions: $t", list(lin_reg.predict(some_data_prepared))) print("labels:$t$t", list(some_label))  predictions: [210644.60459285544, 317768.80697210797, 210956.43331178252, 59218.98886849088, 189747.55849878537] labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0] #### 使用均方根评估线性模型 from sklearn.metrics import mean_squared_error housing_predictions = lin_reg.predict(housing_prepared) lin_mse = mean_squared_error(housing_labels, housing_predictions) # 原label，预测 lin_rmse = np.sqrt(lin_mse) lin_rmse  68628.198198489234 #### 误差过大,欠拟合 原因：数据信息不全面, 模型不给力,这里我们先试着换模型 from sklearn.tree import DecisionTreeRegressor tree_reg = DecisionTreeRegressor() tree_reg.fit(housing_prepared, housing_labels)  DecisionTreeRegressor(criterion=’mse’, max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter=’best’) housing_predictions = tree_reg.predict(housing_prepared) tree_mse = mean_squared_error(housing_labels, housing_predictions) tree_rmse = np.sqrt(tree_mse) tree_rmse  0.0 #### ?! 可能已经过拟合 此时在模型还没有完全调好前, 千万不能拿测试集评估, 我们试着在训练集上划分来验证做Validation. #### 使用交叉验证来初期评估模型 使用$K-fold$ cross-validation$ 默认10折验证

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring='neg_mean_squared_error', cv=10)
rmse_scores = np.sqrt(-scores) # sklearn 返回效用函数(越大越好)而非损失函数故(越小越好)

def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard:", scores.std()) # 标准差大小也反映了估计器的准确性

display_scores(rmse_scores) # 决策树糟糕的过拟合


Scores: [ 70463.07466897 67558.00185896 71538.29332368 68422.0970912
72103.20690142 75299.00463171 71091.17969343 71141.84507942
75366.45554865 69272.26090227]
Mean: 71225.54197
Standard: 2456.4252039

# 对线性模型做10折交叉验证
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring='neg_mean_squared_error', cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores) # 结果也很糟糕但表明了决策树确实是过拟合了


Scores: [ 66782.73843989 66960.118071 70347.95244419 74739.57052552
68031.13388938 71193.84183426 64969.63056405 68281.61137997
71552.91566558 67665.10082067]
Mean: 69052.4613635
Standard: 2731.6740018

##### 尝试随机森林模型
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels) # 训练集，标记
forest_predictions = forest_reg.predict(housing_prepared)

forest_mse = mean_squared_error(housing_labels, forest_predictions) # 标记，预测
forest_rmse = np.sqrt(forest_mse)
forest_rmse


22234.779816704962

# 随机森林的十折交叉验证
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)


#### 序列化保存模型
# 使用python自带的pickle或sklearn的joblib
from sklearn.externals import joblib

# joblib.dump(lin_reg, 'lin_reg_model.pkl')


### Fine-Tune Model

#### 使用sklearn自动化超参优化
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = [
{'n_estimators': [3, 10, 30], 'max_features':[2, 4, 6, 8]},
{'bootstrap':[False], 'n_estimators':[3, 10], 'max_features':[2, 3, 4]},
]
forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, scoring='neg_mean_squared_error', cv=5)

grid_search.fit(housing_prepared, housing_labels)

grid_search.best_params_


{‘max_features’: 8, ‘n_estimators’: 30}

grid_search.best_estimator_


RandomForestRegressor(bootstrap=True, criterion=’mse’, max_depth=None,
max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=30, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
print(np.sqrt(-mean_score), params)


64664.8910255 {‘max_features’: 2, ‘n_estimators’: 3}
55594.9053281 {‘max_features’: 2, ‘n_estimators’: 10}
53321.8104358 {‘max_features’: 2, ‘n_estimators’: 30}
60714.3850386 {‘max_features’: 4, ‘n_estimators’: 3}
52964.4514158 {‘max_features’: 4, ‘n_estimators’: 10}
50342.166786 {‘max_features’: 4, ‘n_estimators’: 30}
59055.4435408 {‘max_features’: 6, ‘n_estimators’: 3}
52197.7377391 {‘max_features’: 6, ‘n_estimators’: 10}
50057.0471926 {‘max_features’: 6, ‘n_estimators’: 30}
58806.8780071 {‘max_features’: 8, ‘n_estimators’: 3}
51818.2362215 {‘max_features’: 8, ‘n_estimators’: 10}
49810.215544 {‘max_features’: 8, ‘n_estimators’: 30}
62377.358573 {‘bootstrap’: False, ‘max_features’: 2, ‘n_estimators’: 3}
54367.9428372 {‘bootstrap’: False, ‘max_features’: 2, ‘n_estimators’: 10}
60106.5084041 {‘bootstrap’: False, ‘max_features’: 3, ‘n_estimators’: 3}
52682.5567174 {‘bootstrap’: False, ‘max_features’: 3, ‘n_estimators’: 10}
59067.1945483 {‘bootstrap’: False, ‘max_features’: 4, ‘n_estimators’: 3}
51862.7182375 {‘bootstrap’: False, ‘max_features’: 4, ‘n_estimators’: 10}

#### Randomized Search

# 当超参组合很多时也可使用随机搜索
from sklearn.model_selection import RandomizedSearchCV


#### 集成方法组合模型会在后续讨论


### 分析调好的模型和误差

feature_importances = grid_search.best_estimator_.feature_importances_ # 特征重要性
feature_importances

extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedroom_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importance, attributes), reverse=True)


### 在测试集上评估系统

final_model = grid_search.best_estimator_

X_test = stra_test_set.drop(["median_house_value"], axis=1)
y_test = stra_test_set["median_house_value"].copy()

X_test_prepared =  full_pipeline.transform(X_test) # 在测试阶段使用transform()而非fit_transform()
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse


### 上线， 监视，维护系统

#### 完整的流水线包含准备数据和预测

full_pipeline_with_predictor = Pipeline([
("preparation", full_pipeline),
("final_model", grid_search.best_estimator_)
])
full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(housing)


### Execises

This site uses Akismet to reduce spam. Learn how your comment data is processed.