sklearn.model

mac2025-06-08  49

在机器学习中,我们通常将原始数据按照比例分割为“测试集”和“训练集”,通常使用sklearn.model_selection里的train_test_split模块用来分割数据。

备注: 旧版本中,使用sklearn.cross_validation里的train_test_split模块用来分割数据。新版本中,cross_validation已经弃用,现在改为从 sklearn.model_selection 中调用train_test_split 函数。

详细用法参考:sklearn.model_selection.train_test_split官方教程

sklearn.model_selection.train_test_split(*arrays, **options)

参数说明:

*arrays :sequence of indexables with same length / shape[0]. 相同长度/行数的可索引序列

Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.可以是列表、numpy数组、scipy稀疏矩阵或pandas的数据框

test_size : float, int or None, optional (default=None). 测试集的大小

(1)If float,should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. 如果为float,则取值范围应在0.0到1.0之间,代表要测试数据集拆分的比例。 (2)If int, represents the absolute number of test samples. 如果为int,则表示测试样本的绝对数量。 (3)If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.如果为None,则将其设置为train_size的补集。 如果train_size也为None,则将其设置为0.25。

train_size: float, int, or None, (default=None). 训练集大小

(1)If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. 如果为float,则取值范围应在0.0到1.0之间,并代表要训练数据集拆分的比例。 (2)If int, represents the absolute number of train samples. 如果为int,则表示训练样本的绝对数量。 (3)If None, the value is automatically set to the complement of the test size. 如果为None,该值将自动设置为test_size的补集。

random_state : int, RandomState instance or None, optional, (default=None). 随机数生成器的状态

(1)If int, random_state is the seed used by the random number generator; 如果为int,则random_state是随机数生成器使用的种子; (2)If RandomState instance, random_state is the random number generator; 如果是RandomState实例,则random_state是随机数生成器; (3)If None, the random number generator is the RandomState instance used by np.random. 如果为None,则随机数生成器是np.random使用的RandomState实例。

shuffle:boolean, optional (default=True) Whether or not to shuffle the data before splitting. 洗牌模式

If shuffle=False then stratify must be None.

stratify: array-like or None (default=None) 类标签分层方式

(1)若为None时,划分出来的测试集或训练集中,其类标签的比例也是随机的; If not None, data is split in a stratified fashion, using this as the class labels. 如果不为None划分出来的测试集或训练集中,其类标签的比例同输入的数组中类标签的比例相同,可以用于处理不均衡的数据集。

常见用法: X_train,X_test, y_train, y_test =sklearn.model_selection.train_test_split(train_data,train_target,test_size=0.4, random_state=0,stratify=y_train)

import numpy as np from sklearn.model_selection import train_test_split X,y = np.arange(30).reshape((10,3)), range(10) print(X) >>> [[ 0 1 2] [ 3 4 5] [ 6 7 8] [ 9 10 11] [12 13 14] [15 16 17] [18 19 20] [21 22 23] [24 25 26] [27 28 29]] print(y) >>> range(0, 10) X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=20,shuffle=True) #划分训练集和测试集 print(X_train) >>> [[15 16 17] [ 0 1 2] [ 6 7 8] [18 19 20] [27 28 29] [12 13 14] [ 9 10 11]] print(X_test) >>> [[21 22 23] [ 3 4 5] [24 25 26]] print(y_train) >>> [5, 0, 2, 6, 9, 4, 3] print(y_test) >>> [7, 1, 8]
最新回复(0)