接前面 最基础的分类算法-k近邻算法 kNN简介及Jupyter基础实现及Python实现
回过头来看这张图,什么是机器学习?就是将训练数据集喂给机器学习算法,在上面kNN算法中就是将特征集X_train和Y_train传给机器学习算法,然后拟合(fit)出一个模型,然后输入样例到该模型进行预测(predict)输出结果。
而对于kNN来说,算法的模型其实就是自身的训练数据集,所以可以说kNN是一个不需要训练过程的算法。
k近邻算法是非常特殊的,可以被认为是没有模型的算法
为了和其他算法统一,可以认为训练数据集就是模型本身
使用scikit-learn中的kNN实现
from sklearn
.neighbors
import KNeighborsClassifier
raw_data_X
= [[3.393533211, 2.331273381],
[3.110073483, 1.781539638],
[1.343808831, 3.368360954],
[3.582294042, 4.679179110],
[2.280362439, 2.866990263],
[7.423436942, 4.696522875],
[5.745051997, 3.533989803],
[9.172168622, 2.511101045],
[7.792783481, 3.424088941],
[7.939820817, 0.791637231]
]
raw_data_y
= [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train
= np
.array
(raw_data_X
)
Y_train
= np
.array
(raw_data_y
)
kNN_classifier
= KNeighborsClassifier
(n_neighbors
=6)
kNN_classifier
.fit
(X_train
,Y_train
)
X_predict
= x
.reshape
(1,-1)
y_predict
= kNN_classifier
.predict
(X_predict
)
In
[1]: y_predict
Out
[1]: array
([1])
In
[2]: y_predict
[0]
Out
[2]: 1
基于scikit-learn的fit-predict模式,进行重写我们的Python实现,进行简单的自定义fit和predict方法实现kNN
import numpy
as np
from math
import sqrt
from collections
import Counter
class KNNClassifier:
def __init__(self
, k
):
'''初始化KNN分类器'''
assert k
>= 1, "k must be valid"
self
.k
= k
;
self
._X_train
= None
self
._Y_train
= None
def fit(seld
, X_train
, Y_train
):
'''根据训练集X_train和Y_train训练kNN分类器'''
assert X_train
.shape
[0] == Y_train
.shape
[0],\
"the size of X_train must be equal to the size of Y_train"
assert self
.k
<= X_train
.shape
[0],\
"the size of X_train must be at least k"
self
._X_train
= X_train
self
._Y_train
= Y_train
return self
def predict(self
, X_predict
):
'''给定待预测数据集X_train,返回表示X_predict的结果向量'''
assert self
._X_train
is not None and self
._Y_train
is not None,\
"must fit before predict!"
assert X_predict
.shape
[1] == self
._X_train
.shape
[1],\
"the feature number of X_predict must equal to X_train"
y_predict
= [self
._predict
(x
) for x
in X_predict
]
return np
.array
(y_predict
)
def _predict(self
, x
):
'''给定单个待预测数据x,返回x的预测结果值'''
'''此处的代码逻辑和重写之前的代码逻辑一样'''
distances
= [sqrt
(np
.sum(x_train
- x
) ** 2) for x_train
in self
._X_train
]
nearest
= np
.argsort
(distances
)
topk_y
= [self
._Y_train
[i
] for i
in nearest
[:self
.k
]]
votes
= Counter
(topk_y
)
return votes
.most_common
(1)[0][0]
def __repr__(self
):
return "KNN(k=%d)" % self
.k