更多机器学习知识请查收于: https://blog.csdn.net/weixin_45316122/article/details/109854595
Trick:纯demo,心在哪里,结果就在那里
# -*- coding: utf-8 -*- # Author : szy # Create Date : 2019/10/15 # 深入浅析Python 中的sklearn模型选择 notebooke= """ 一.主要功能如下: 1.classification分类 2.Regression回归 3.Clustering聚类 4.Dimensionality reduction降维 5.Model selection模型选择 6.Preprocessing预处理 二.主要模块分类: 1.sklearn.base: Base classes and utility function基础实用函数 2.sklearn.cluster: Clustering聚类 3.sklearn.cluster.bicluster: Biclustering 双向聚类 4.sklearn.covariance: Covariance Estimators 协方差估计 5.sklearn.model_selection: Model Selection 模型选择 6.sklearn.datasets: Datasets 数据集 7.sklearn.decomposition: Matrix Decomposition 矩阵分解 8.sklearn.dummy: Dummy estimators 虚拟估计 9.sklearn.ensemble: Ensemble Methods 集成方法 10.sklearn.exceptions: Exceptions and warnings 异常和警告 11.sklearn.feature_extraction: Feature Extraction 特征抽取 12.sklearn.feature_selection: Feature Selection 特征选择 13。sklearn.gaussian_process: Gaussian Processes 高斯过程 14.sklearn.isotonic: Isotonic regression 保序回归 15.sklearn.kernel_approximation: Kernel Approximation 核 逼近 16.sklearn.kernel_ridge: Kernel Ridge Regression 岭回归ridge 17.sklearn.discriminant_analysis: Discriminant Analysis 判别分析 18.sklearn.linear_model: Generalized Linear Models 广义线性模型 19.sklearn.manifold: Manifold Learning 流形学习 20.sklearn.metrics: Metrics 度量 权值 21.sklearn.mixture: Gaussian Mixture Models 高斯混合模型 22.sklearn.multiclass: Multiclass and multilabel classification 多等级标签分类 23.sklearn.multioutput: Multioutput regression and classification 多元回归和分类 24.sklearn.naive_bayes: Naive Bayes 朴素贝叶斯 25.sklearn.neighbors: Nearest Neighbors 最近邻 26.sklearn.neural_network: Neural network models 神经网络 27.sklearn.calibration: Probability Calibration 概率校准 28.sklearn.cross_decomposition: Cross decomposition 交叉求解 29.sklearn.pipeline: Pipeline 管道 30.sklearn.preprocessing: Preprocessing and Normalization 预处理和标准化 31.sklearn.random_projection: Random projection 随机映射 32.sklearn.semi_supervised: Semi-Supervised Learning 半监督学习 33.sklearn.svm: Support Vector Machines 支持向量机 34.sklearn.tree: Decision Tree 决策树 35.sklearn.utils: Utilities 实用工具 """ #加载数据(Data Loading) import numpy as np import urllib import requests # url with dataset url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" # download the file raw_data = requests.get(url) # load the CSV file as a numpy matrix dataset = np.loadtxt(raw_data, delimiter=",") # separate the data from the target attributes X = dataset[:,0:7] y = dataset[:,8] "----------------------------------------------------------------------------------------" #数据归一化(Data Normalization) """ 大多数机器学习算法中的梯度方法对于数据的缩放和尺度都是很敏感的, 在开始跑算法之前,我们应该进行归一化或者标准化的过程, 这使得特征数据缩放到0-1范围中。scikit-learn提供了归一化的方法: """ from sklearn import preprocessing normalized_X = preprocessing.normalize(X) standardized_X = preprocessing.scale() "------------------------------------------------------------------------------------" # 特征选择(Feature Selection) """ 在解决一个实际问题的过程中,选择合适的特征或者构建特征的能力特别重要。这成为特征选择或者特征工程。 特征选择时一个很需要创造力的过程,更多的依赖于直觉和专业知识,并且有很多现成的算法来进行特征的选择。 下面的树算法(Tree algorithms)计算特征的信息量: """ from sklearn import metrics from sklearn.ensemble import ExtraTreesClassifier model = ExtraTreesClassifier() model.fit(X,y) print(model.feature_importances_) "-----------------------------------------------------------------------------" #算法总结 # 01逻辑回归 # 大多数问题都可以归结为二元分类问题。这个算法的优点是可以给出数据所在类别的概率。 from sklearn import metrics from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X,y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected,predicted)) print(metrics.confusion_matrix(expected,predicted)) "------------------------------------------------------------------------------" #02朴素贝叶斯 # 该方法的任务是还原训练样本数据的分布密度,其在多类别分类中有很好的效果。 from sklearn.naive_bayes import GaussianNB model = GaussianNB() model.fit(X,y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected,predicted)) print(metrics.confusion_matrix(expected,predicted)) "------------------------------------------------------------------------------" #03k近邻 from sklearn.neighbors import KNeighborsClassifier # fit a k-nearest neighbor model to the data model = KNeighborsClassifier() model.fit(X,y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected,predicted)) print(metrics.confusion_matrix(expected,predicted)) "------------------------------------------------------------------------------" # 04决策树 # 分类与回归树(Classification and Regression Trees ,CART)算法 # 常用于特征含有类别信息的分类或者回归问题,这种方法非常适用于多分类情况。 from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(X,y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected,predicted)) print(metrics.confusion_matrix(expected,predicted)) "------------------------------------------------------------------------------" #05支持向量机 # SVM是非常流行的机器学习算法,主要用于分类问题, # 如同逻辑回归问题,它可以使用一对多的方法进行多类别的分类。 #SVC 分类 SVR回归 from sklearn.svm import SVC model = SVC() model.fit(X,y) print(model) # make predictions expected = y predicted = model.predict(X) # summarize the fit of the model print(metrics.classification_report(expected,predicted)) print(metrics.confusion_matrix(expected,predicted)) "------------------------------------------------------------------------------" #如何优化算法参数 """ 一项更加困难的任务是构建一个有效的方法用于选择正确的参数,我们需要用搜索的方法来确定参数。 scikit-learn提供了实现这一目标的函数。下面的例子是一个进行正则参数选择的程序 """ import numpy as np from sklearn.linear_model import Ridge from sklearn.grid_search import GridSearchCV # prepare a range of alpha values to test alphas = np.array([1,0.1,0.01,0.001,0.0001,0]) # create and fit a ridge regression model, testing each alpha model = Ridge() grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas)) grid.fit(X, y) print(grid) # summarize the results of the grid search print(grid.best_score_) print(grid.best_estimator_.alpha) #有时随机从给定区间中选择参数是很有效的方法,然后根据这些参数来评估算法的效果进而选择最佳的那个。 import numpy as np from scipy.stats import uniform as sp_rand from sklearn.linear_model import Ridge from sklearn.grid_search import RandomizedSearchCV # prepare a uniform distribution to sample for the alpha parameter param_grid = {'alpha': sp_rand()} # create and fit a ridge regression model, testing random alpha values model = Ridge() rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100) rsearch.fit(X, y) print(rsearch) # summarize the results of the random parameter search print(rsearch.best_score_) print(rsearch.best_estimator_.alpha)