上次写的是纽约大学homework4 test1这次,我会简要实现一下纽约大学homework4 test2. 总的而言,这次的test,是利用线性回归解决预测boston房价的problem。 ok let us begin
首先,我们加载入boston_data,并看一下数据集中不同features之间的相关联系数矩阵
import pandas as pd from sklearn.datasets import load_boston boston = load_boston() Y = boston.target boston = pd.DataFrame(boston.data) print(boston.corr())根据打印的结果我们可以看出来,features共有13个,这或许与官方提供的数据(https://www.kaggle.com/c/boston-housing)不太相同,那是因为这里面的数据delete掉了medv这个属性值,有关features代表的相关的含义可以查看官方所提供的属性解释,就是上面的超链接。 其次,我们觉得仅仅用1,2 …表示列名不够清晰,我将列名称以及行名称转换为相应的属性值名称之后,再次进行打印输出。
import pandas as pd from sklearn.datasets import load_boston boston = load_boston() Y = boston.target boston = pd.DataFrame(boston.data,columns=['crim','zn','indus','chas','nox','rm','age','dis','rad','tax','ptratio','black','lstat']) print(boston.corr())然后,我们画出correlation matrix
import matplotlib.pyplot as plt plt.matshow(boston.corr(), cmap=plt.cm.jet) plt.show()OK,了解完了数据集,接下来就是测试数据集与训练数据集的分割以及模型的建立与训练了 首先,先看一下测试数据集与训练数据集的分割,以及数据集的标准化,采用的是最大值最小值标准化,关于为什么要标准化的问题,可以搜索相关的资料自行查看,我有时间也会总结一下发出来
from sklearn.model_selection import train_test_split import numpy as np X_train,X_test,y_train,y_test=train_test_split(boston,Y,random_state=0,test_size=0.20) min_max_scaler = preprocessing.MinMaxScaler() X_train=min_max_scaler.fit_transform(X_train) X_test=min_max_scaler.fit_transform(X_test) y_train=min_max_scaler.fit_transform(y_train.reshape(-1,1)) y_test=min_max_scaler.fit_transform(y_test.reshape(-1,1))数据集分割完成之后,接下来就是线性回归模型的建立了,直接看代码
# 6. Then, please predict new values using the test set. # Please give the coefficient for your model. lr=LinearRegression() lr.fit(X_train,y_train) lr_y_predict=lr.predict(X_test) print(lr.score(X_test, y_test)) # 7. The sign of a regression coefficient tells you whether there is a positive or negative correlation # between each independent variable and the dependent variable. What does a positive coefficient and a negative coefficient indicate respectively? weight = lr.coef_ bias = lr.intercept_ print(weight) print(bias) # 8. Finally, to gain an understanding of how your model is performing, please score the model against three metrics: R squared, mean squared error, # and mean absolute error. Write the lines of code to get your output; and answer the questions: # a) Google R Squared, Mean Squared Error, and Mean Absolute Error. What do these metrics # mean? What are the numbers telling you? score = r2_score(y_test, lr_y_predict) mse_test=np.sum((lr_y_predict-y_test)**2)/len(y_test) mae_test=np.sum(np.absolute(lr_y_predict-y_test))/len(y_test) print(score) print(mse_test) print(mae_test)做到这里时候,如果不出意外的话,正确率应该在6成左右,这么低的准确率怎么kennel好意思交作业呢,还记得我上一次的New York university homework4 task1 文章里说的内容了么?数据集不进行预处理,就相当于不做,不优化,不成魔。 看优化,这里面的优化采用的方法很简单,就是仅仅使用上述过程中与结果有正相关并且正相关权重较大的features。
# b) What do you think could improve the model? Try the possible improved model in coding lines as a bonus. # improved model one : use only positive coefficient to train the model dataset = load_boston() x_data = dataset.data y_data = dataset.target name_data = dataset.feature_names x_data = dataset.data y_data = dataset.target i_=[] for i in range(len(y_data)): if y_data[i] == 50: # to store the error value that the price of the house which one is < 50 i_.append(i) # to delete the error value x_data = delete(x_data,i_,axis=0) y_data = delete(y_data,i_,axis=0) name_data = dataset.feature_names j_=[] for i in range(13): if name_data[i] == 'RM'or name_data[i] == 'PTRATIO'or name_data[i] == 'LSTAT': continue # to memory the unimportant features j_.append(i) # delete some unimportant features from the data x_data = delete(x_data,j_,axis=1) X_train,X_test,y_train,y_test=train_test_split(x_data,y_data,random_state=0,test_size=0.20) min_max_scaler = preprocessing.MinMaxScaler() X_train=min_max_scaler.fit_transform(X_train) X_test=min_max_scaler.fit_transform(X_test) y_train=min_max_scaler.fit_transform(y_train.reshape(-1,1)) y_test=min_max_scaler.fit_transform(y_test.reshape(-1,1)) lr=LinearRegression() lr.fit(X_train,y_train) lr_y_predict=lr.predict(X_test) score = r2_score(y_test, lr_y_predict) print(score)这样,经过这个简简单单的数据集的预处理过程就可以将正确率提高20个百分点,很开心对不对,反正我挺开心的。 最后,可视化结果
def show_res(y_test, y_predict): plt.figure() x = np.arange(0, len(y_predict)) plt.plot(x, y_test, marker='*') plt.plot(x, y_predict, marker='o') plt.title('the predict price and the real price of the bostons house ') plt.xlabel('x') plt.ylabel('house price') plt.legend(['real price', 'predict price']) plt.show() show_res(y_test,lr_y_predict)看结果
人生苦短,我用python