机器学习案例整理—手写数字识别-SVM

mac2024-05-20 38

1.导包

import numpy as np import pandas as pd import os from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt, matplotlib.image as mpimg import time import warnings from sklearn import svm from sklearn.model_selection import GridSearchCV %matplotlib inline warnings.filterwarnings('ignore') # 忽略警报

%matplotlib inline 仅在jupyter中会用到当你调用matplotlib.pyplot的绘图函数plot()进行绘图的时候，或者生成一个figure画布的时候，可以直接在你的python console里面生成图像

2.数据处理

2.1 读取文件

#读取csv数据文件 data = pd.read_csv('E:/Digit_Recognizer/train.csv') print("Train Data Shape is: ",data.shape)

Train Data Shape is: (42000, 785)

label = data.label # 读取data数据表中label列，并令其为labal data = data.drop('label',axis=1) # drop([ '列名' ],axis=0/1（行/列）,inplace=True) True表示在原数据上改变 print("Data Shape: ",data.shape) print("Label Shape: ",label.shape)

Data Shape: (42000, 784) Label Shape: (42000,)

data.columns #可以通过.columns和.index这两个属性返回数据集的列索引和行索引

Index([‘pixel0’, ‘pixel1’, ‘pixel2’, ‘pixel3’, ‘pixel4’, ‘pixel5’, ‘pixel6’, ‘pixel7’, ‘pixel8’, ‘pixel9’, … ‘pixel774’, ‘pixel775’, ‘pixel776’, ‘pixel777’, ‘pixel778’, ‘pixel779’, ‘pixel780’, ‘pixel781’, ‘pixel782’, ‘pixel783’], dtype=‘object’, length=784)

2.2 使用“reshape”将一维数组转换为二维28x28数组，以打印和查看灰度图像。（这段有点没太看懂诶）

for x in range(0,4): train_0=data[label==x] # ? train_0 抽取出来的一个子数据集？ data_new=[] for idx in train_0.index: val=train_0.loc[idx].values.reshape(28,28) # .loc[idx]获取train_0中第idx行将其reshape为(28，28) a.loc['one']则会默认表示选取行为'one'的行； data_new.append(val) # 把上一行reshape的数据加入data_new的空列表中 plt.figure(figsize=(25,25)) for x in range(1,5): #为啥是1到5？ ax1=plt.subplot(1, 20, x) ax1.imshow(data_new[x],cmap='gray')

plt.figure（）新建画布，figsize:指定figure的宽和高，单位为英寸 plt.subplot 生成子图，(‘行’,‘列’,‘编号’)返回第一行第20列的第x个子图（x=1，2，3，4） imshow()函数实现热图绘制，data_new[x]（即为X存储图像） cmap=‘gray’ 把设置图像颜色变成灰度 2.3 把数据集拆成80%的训练集和20%的测试集

train, test,train_labels, test_labels = train_test_split(data, label, train_size=0.8, random_state=42) #random_state相当于随机数种子random.seed() print("Train Data Shape: ",train.shape) print("Train Label Shape: ",train_labels.shape) print("Test Data Shape: ",test.shape) print("Test Label Shape: ",test_labels.shape)

Train Data Shape: (33600, 784) Train Label Shape: (33600,) Test Data Shape: (8400, 784) Test Label Shape: (8400,)

3.SVM

i=5000; score=[] fittime=[] scoretime=[] clf = svm.SVC(random_state=42) print("Default Parameters are: \n",clf.get_params) # 获得svm.SVC的默认参数

svm.SVC()括号内各项参数详解： [https://blog.csdn.net/weixin_41990278/article/details/93137009]

Case 1 - Gray Scale Images 灰度图像

start_time = time.time() clf.fit(train[:i], train_labels[:i].values.ravel()) fittime = time.time() - start_time #计算运行时间 print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(fittime))) #%H:%M:%S 时分秒 start_time = time.time() score=clf.score(test,test_labels) print("Accuracy for grayscale: ",score) scoretime = time.time() - start_time print("Time consumed to score: ",time.strftime("%H:%M:%S", time.gmtime(scoretime))) #%H:%M:%S 时分秒 case1=[score,fittime,scoretime]

time.time() ——返回当前时间的时间戳 clf.fit(x,y） ——x代表输入数据（0到5000），y代表标签； model.fit()函数 .ravel() ——将多维数组转换为一维数组的功能，但不会产生副本 avel()、flatten()、squeeze()的用法与区别 time.strftime() ——函数接收以时间元组，并返回以可读字符串表示的当地时间， time strftime()方法 time.gmtime() ——获取的时间为UTC时区（0时区）的struct_time，但是我们计算机显示的是东八区时间（+8），所以的得到的struct_time+8即为现在计算机显示的时间（按照所处不同时区计算）。 clf.score() ——score(self, X, y, sample_weight=None) 提供了一个缺省的评估法则来解决问题，用你训练好的模型在测试集上进行评分（0~1）1分代表最好。准确度怎么这么低？确实是这么低= =！

Case 2 - Binary Images二值图像

简单地说，通过将所有大于0的值替换为1，将图像从灰度转换为黑白。以及使用整形将1d阵列转换为2d28x28阵列，以绘制和查看二进制图像。

test_b=test # 复制一次test数据和train数据 train_b=train test_b[test_b>0]=1 #让test和train中大于0的值=1 train_b[train_b>0]=1 for x in range(0,4): #此处处理数据同灰度案例 train_0=train_b[train_labels==x] data_new=[] for idx in train_0.index: val=train_0.loc[idx].values.reshape(28,28) data_new.append(val) plt.figure(figsize=(25,25)) for x in range(1,5): ax1=plt.subplot(1, 20, x) # ax1是1行20列中的第x个子图 ax1.imshow(data_new[x],cmap='binary') # 设置的颜色模式改为二值

训练模型的代码同上

start_time = time.time() clf.fit(train_b[:i], train_labels[:i].values.ravel()) fititme = time.time() - start_time print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(fittime))) score=clf.score(test_b,test_labels) start_time = time.time() clf.fit(train_b[:i], train_labels[:i].values.ravel()) print("Accuracy for binary: ",score) scoretime = time.time() - start_time print("Time consumed to score: ",time.strftime("%H:%M:%S", time.gmtime(scoretime))) case2=[score,fittime,scoretime]

对比之前的灰度案例可以看出，训练时间快了很多，且案例2（91%）的准确率远高于案例1（9.3%）。然而，数据的高维性使得计算时间越来越长。使用PCA（主成分分析）减少维度

case3 灰度+降维——PCA主成分分析

进行PCA分析前，先对数据进行标准化导包>>标准化数据>>进行PCA训练>>训练后的数据进行降维>>计算方差百分比并进行累加>>选取主成分>>进行训练并降维

from sklearn.preprocessing import StandardScaler # .preprocessing数据预处理 from sklearn.decomposition import PCA as sklearnPCA # 进行PCA降维 # 先标准化数据 sc = StandardScaler().fit(train) # sc里面存的有计算出来的均值和方差 X_std_train = sc.transform(train) # 再用sc中的均值和方差来转换X(即为train)，使train标准化 X_std_test = sc.transform(test) #如果未设置n_components，则存储所有组件 sklearn_pca = sklearnPCA().fit(X_std_train) # sklearnPCA()为导入PAC包的简称/ 对标准化后的训练集进行PCA训练 train_pca = sklearn_pca.transform(X_std_train) # 对训练后的数据转换成降维后的数据 test_pca = sklearn_pca.transform(X_std_test) #每个选定成分分量解释的方差百分比 #如果未设置n_components则存储所有成分，并且比率之和等于 1.0 var_per = sklearn_pca.explained_variance_ratio_ #.explained_variance_ratio_返回所保留的n个成分各自的方差百分比。 cum_var_per = sklearn_pca.explained_variance_ratio_.cumsum() #把方差百分比累加？

StandardScaler().fit() ——用于计算训练数据的均值和方差，后面就会用均值和方差来转换训练数据 fit、fit_transform、transform的区别及使用方法

PCA方法 .fit(X,y=None) ——fit(X)，表示用数据X来训练PCA模型。 fit()可以说是scikit-learn中通用的方法，每个需要训练的算法都会有fit()方法，它其实就是算法中的“训练”这一步骤。因为PCA是无监督学习算法，此处y自然等于None。 .transform(X) ——将数据X转换成降维后的数据。当模型训练好后，对于新输入的数据，都可以用transform方法来降维。 PCA简介/属性/其他方法

#通过选择累积在 0.90 以内的成分来保留 90% 的信息。 n_comp=len(cum_var_per[cum_var_per <= 0.90]) # 选取方差累加和大于0.9的成分并且计算总数 print("Keeping 90% Info with ",n_comp," components") # 打印出计算的结果 sklearn_pca = sklearnPCA(n_components=n_comp) train_pca = sklearn_pca.fit_transform(X_std_train) # 用X_std_train来训练PCA模型，同时返回降维后的数据。 test_pca = sklearn_pca.transform(X_std_test) # 表示用数据X_std_test来训练PCA模型。 print("Shape before PCA for Train: ",X_std_train.shape) print("Shape after PCA for Train: ",train_pca.shape) print("Shape before PCA for Test: ",X_std_test.shape) print("Shape after PCA for Test: ",test_pca.shape)

sklearnPCA即sklearn.decomposition.PCA() sklearn.decomposition.PCA (n_components=None, copy=True, whiten=False) 其中n_components: PCA算法中所要保留的主成分个数n，也即保留下来的特征个数n （其他属性）

#让我们使用降维后的相同数量的样本数据来计算分数，并比较准确性 start_time = time.time() clf.fit(train_pca[:i], train_labels[:i].values.ravel()) fittime = time.time() - start_time print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(fittime))) start_time = time.time() score=clf.score(test_pca,test_labels) print("Accuracy for grayscale: ",score) scoretime = time.time() - start_time print("Time consumed to score model: ",time.strftime("%H:%M:%S", time.gmtime(scoretime))) case3=[score,fittime,scoretime]

代码块同上运行速度很快，准确率达91.87%

Case 4 - 二值图 + 降维——PCA

from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA as sklearnPCA sc = StandardScaler().fit(train_b) X_std_train = sc.transform(train_b) X_std_test = sc.transform(test_b) sklearn_pca = sklearnPCA().fit(X_std_train) #train_pca_b = sklearn_pca.transform(X_std_train) #test_pca_b = sklearn_pca.transform(X_std_test) var_per = sklearn_pca.explained_variance_ratio_ cum_var_per = sklearn_pca.explained_variance_ratio_.cumsum() n_comp=len(cum_var_per[cum_var_per <= 0.90]) print("Keeping 90% Info with ",n_comp," components") sklearn_pca = sklearnPCA(n_components=n_comp) train_pca_b = sklearn_pca.fit_transform(X_std_train) test_pca_b = sklearn_pca.transform(X_std_test) print("Shape before PCA for Train: ",X_std_train.shape) print("Shape after PCA for Train: ",train_pca_b.shape) print("Shape before PCA for Test: ",X_std_test.shape) print("Shape after PCA for Test: ",test_pca_b.shape) start_time = time.time() clf.fit(train_pca_b[:i], train_labels[:i].values.ravel()) fittime = time.time() - start_time print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(fittime))) start_time = time.time() score=clf.score(test_pca_b,test_labels) print("Accuracy for grayscale: ",score) scoretime = time.time() - start_time print("Time consumed to score model: ",time.strftime("%H:%M:%S", time.gmtime(scoretime))) case4=[score,fittime,scoretime]

代码同上，将需要处理的数据集改成train_b和test_b train_pca_b及test_pca_b

比较四个案例

把四个案例的数据都打印出来进行比较

head =["Accuracy","FittingTime","ScoringTime"] # 定义表格的每一行名称 print("\t\t case1 \t\t\t case2 \t\t\t case3 \t\t\t case4") # \t：制表符，为了在不使用表格的情况下，上下对齐，table的意思。 for h, c1, c2, c3, c4 in zip(head, case1, case2, case3, case4): print("{}\t{}\t{}\t{}\t{}".format(h, c1, c2, c3, c4))

结论：

通过简化案例2中的问题（通过将图像转换为二进制），对于所选的样本数，精度从9%提高到91%。

通过减小案例3和案例4中的维度，训练时间大大缩短。

4.训练数据大小与准确性、拟合和得分时间

了解训练数据大小如何影响准确性。

from tqdm import tqdm #Tqdm 是一个快速，可扩展的Python进度条 fit_time=[] score=[] score_time=[] for j in tqdm(range(1000,31000,5000)): #从1000到31000，步长为5000 start_time = time.time() clf.fit(train_pca_b[:j], train_labels[:j].values.ravel()) fit_time.append(time.time() - start_time) start_time = time.time() score.append(clf.score(test_pca_b,test_labels)) # PCA降维后的测试集进行估计并加入到score列表中。 score_time.append(time.time() - start_time)

Tqdm—— 一个快速，可扩展的Python进度条不是很懂它的用法 tqdm介绍及常用方法

x=list(range(1000,31000,5000)) # x=[1000,6000,11000,16000,21000,26000,31000] plt.figure(figsize=[20,5]); # 画一个长20宽5的图 ax1=plt.subplot(1, 2,1) # ax1子图是一行两列中的第1个图 ax1.plot(x,score,'-o'); # 绘制实线、实心圈标记的图 plt.xlabel('Number of Training Samples') # x轴标题 plt.ylabel('Accuray') # y轴标题 ax2=plt.subplot(1, 2,2) ax2.plot(x,score_time,'-o'); ax2.plot(x,fit_time,'-o'); plt.xlabel('Number of Training Samples') plt.ylabel('Time to Compute Score/Fit (sec)') plt.legend(['score_time','fitting_time']) # plt.legend 给图像加上图例

.plot(x,y,format_string) x轴数据，y轴数据，format_string控制曲线的格式字串（由颜色字符，风格字符，和标记字符组成）plt.plot()函数细节图像系列： plt.figure()——绘制图像 plt.subplot() ——绘制子图 plt.xlabel() ——图像x轴标题 plt.ylabel() ——图像y轴标题 plt.legend() —— 给图像加上图例

5.基于GridSearchCV的支持向量机参数选择

在下面的参数中，我们将使用gamma和c，其中Gamma是高斯核的参数（用于处理非线性分类）c是软边际成本函数的参数，也称为误分类成本。一个大的C给你低偏差和高方差，反之亦然。

要找到最佳组合的参数，以达到最大的精度，使用来自SKEXCEL库的GRIDSKCHCV。gridsearchcv对估计器的指定参数值进行穷举搜索。将要传递给GridSearchCV的参数值存储在参数中，保持交叉验证倍数为3，并将支持向量机作为估计器。

parameters = {'gamma': [1, 0.1, 0.01, 0.001], #创建一个字典 'C': [1000, 100, 10, 1]} p = GridSearchCV(clf , param_grid=parameters, cv=3)

GridSearchCV() estimator（clf） ——选择使用的分类器（svm） param_grid ——需要最优化的参数的取值，值为字典或者列表 cv ——交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，也可以是yield训练/测试数据的生成器。 gridSearchCV（网格搜索）的参数、方法及示例

#在案例四的基础上进行调参 X=train_pca_b[:i] #选取0到5000之间已经pca过的训练集 y=train_labels[:i].values.ravel() start_time = time.time() p.fit(X,y) elapsed_time = time.time() - start_time print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(elapsed_time))) print("Scores for all Parameter Combination: \n",p.cv_results_['mean_test_score']) # 'mean_test_score'是 cv_results_里的一个参数 print("\nOptimal C and Gamma Combination: ",p.best_params_) print("\nMaximum Accuracy acheieved on LeftOut Data: ",p.best_score_)

clf.cv_results_ —— 返回使用交叉验证进行搜索的结果，它本身是一个字典，里面有很多内容。 .best_params_ ——返回最好的参数 .best_score_ —— 返回最好的测试分数其他见此链接

#为了验证，让我们将最佳参数传递给分类器并检查分数。 C=p.best_params_['C'] gamma=p.best_params_['gamma'] clf=svm.SVC(C=C,gamma=gamma, random_state=42) start_time = time.time() clf.fit(train_pca_b[:i], train_labels[:i].values.ravel()) elapsed_time = time.time() - start_time print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(elapsed_time))) print("Accuracy for binary: ",clf.score(test_pca_b,test_labels))

运行结果：可以看出，对于所选择的训练样本，案例2对于最佳参数的准确性从91%提高到93.7%。现在使用所有训练示例：

start_time = time.time() clf.fit(train_pca_b, train_labels.values.ravel()) elapsed_time = time.time() - start_time print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(elapsed_time))) print("Accuracy for binary: ",clf.score(test_pca_b,test_labels))

准确率提高了很多~

啊终于整理完了惹୧(๑•̀◡•́๑)૭

最新回复(0)