Kaggle实战:Rain in Australia 数据集建模预测

mac2022-06-30  222

文章目录

数据详情单变量分析离散值连续值 建模逻辑回归模型评估 随机森林随机森林调参 朴素贝叶斯人工神经网络 数据集来源 https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

数据集和 notebook 地址

数据详情

包含了某段时间内,每一天的天气观测值,目的是为了预测明天是否会下雨

Date:The date of observation Location:The common name of the location of the weather station MinTemp:The minimum temperature in degrees celsius MaxTemp:The maximum temperature in degrees celsius Rainfall:The amount of rainfall recorded for the day in mm Evaporation:The so-called Class A pan evaporation (mm) in the 24 hours to 9am Sunshine:The number of hours of bright sunshine in the day. WindGustDir:The direction of the strongest wind gust in the 24 hours to midnight WindGustSpeed:The speed (km/h) of the strongest wind gust in the 24 hours to midnight WindDir9am:Direction of the wind at 9am WindDir3pm:Direction of the wind at 3pm WindSpeed9am:Wind speed (km/hr) averaged over 10 minutes prior to 9am WindSpeed3pm:Wind speed (km/hr) averaged over 10 minutes prior to 3pm Humidity9am:Humidity (percent) at 9am Humidity3pm:Humidity (percent) at 3pm Pressure9am:Atmospheric pressure (hpa) reduced to mean sea level at 9am Pressure3pm:Atmospheric pressure (hpa) reduced to mean sea level at 3pm Cloud9am:Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast. Cloud3pm:Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm. See Cload9am for a description of the values Temp9am:Temperature (degrees C) at 9am Temp3pm:Temperature (degrees C) at 3pm RainTodayBoolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0 RISK_MM:The amount of next day rain in mm. Used to create response variable RainTomorrow. A kind of measure of the "risk". RainTomorrow:The target variable. Did it rain tomorrow?

特征详情:

Date:观察特征的那一天 Location:观察的城市 MinTemp:当天最低温度(摄氏度) MaxTemp:当天最高温度(摄氏度)温度都是 string Rainfall:当天的降雨量(单位是毫米mm) Evaporation:一个凹地上面水的蒸发量(单位是毫米mm),24小时内到早上9点 Sunshine:一天中出太阳的小时数 WindGustDir:最强劲的那股风的风向,24小时内到午夜 WindGustSpeed:最强劲的那股风的风速(km/h),24小时内到午夜 WindDir9am:上午9点的风向 WindDir3pm:下午3点的风向 WindSpeed9am:上午9点之前的十分钟里的平均风速,即 8:50~9:00的平均风速,单位是(km/hr) WindSpeed3pm:下午3点之前的十分钟里的平均风速,即 14:50~15:00的平均风速,单位是(km/hr) Humidity9am:上午9点的湿度 Humidity3pm:下午3点的湿度 Pressure9am:上午9点的大气压强(hpa) Pressure3pm:下午3点的大气压强 Cloud9am:上午9点天空中云的密度,取值是[0, 8],以1位一个单位,0的话表示天空中几乎没云,8的话表示天空中几乎被云覆盖了 Cloud3pm:下午3点天空中云的密度 Temp9am:上午9点的温度(单位是摄氏度) Temp3pm:下午3点的温度(单位是摄氏度) RainTodayBoolean: 今天是否下雨 RISK_MM:明天下雨的风险值(应当是数据提供者创建的一个特征) 来自数据提供者的提醒:Note: You should exclude the variable Risk-MM when training a binary classification model. Not excluding it will leak the answers to your model and reduce its predictability. 就是建模的时候要删掉这个特征 RainTomorrow:标签 import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns dataset = pd.read_csv('./weatherAUS.csv') dataset.shape (142193, 24) dataset.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 142193 entries, 0 to 142192 Data columns (total 24 columns): Date 142193 non-null object Location 142193 non-null object MinTemp 141556 non-null float64 MaxTemp 141871 non-null float64 Rainfall 140787 non-null float64 Evaporation 81350 non-null float64 Sunshine 74377 non-null float64 WindGustDir 132863 non-null object WindGustSpeed 132923 non-null float64 WindDir9am 132180 non-null object WindDir3pm 138415 non-null object WindSpeed9am 140845 non-null float64 WindSpeed3pm 139563 non-null float64 Humidity9am 140419 non-null float64 Humidity3pm 138583 non-null float64 Pressure9am 128179 non-null float64 Pressure3pm 128212 non-null float64 Cloud9am 88536 non-null float64 Cloud3pm 85099 non-null float64 Temp9am 141289 non-null float64 Temp3pm 139467 non-null float64 RainToday 140787 non-null object RISK_MM 142193 non-null float64 RainTomorrow 142193 non-null object dtypes: float64(17), object(7) memory usage: 26.0+ MB dataset.isnull().sum() / len(dataset) Date 0.000000 Location 0.000000 MinTemp 0.004480 MaxTemp 0.002265 Rainfall 0.009888 Evaporation 0.427890 Sunshine 0.476929 WindGustDir 0.065615 WindGustSpeed 0.065193 WindDir9am 0.070418 WindDir3pm 0.026570 WindSpeed9am 0.009480 WindSpeed3pm 0.018496 Humidity9am 0.012476 Humidity3pm 0.025388 Pressure9am 0.098556 Pressure3pm 0.098324 Cloud9am 0.377353 Cloud3pm 0.401525 Temp9am 0.006358 Temp3pm 0.019171 RainToday 0.009888 RISK_MM 0.000000 RainTomorrow 0.000000 dtype: float64

单变量分析

离散值

catgorical = [cat for cat in dataset.columns if dataset[cat].dtype == 'O'] catgorical ['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow'] for i in catgorical: print(i) print(len(dataset[i].unique())) print() Date 3436 Location 49 WindGustDir 17 WindDir9am 17 WindDir3pm 17 RainToday 3 RainTomorrow 2

date有非常多的唯一值,说明这个特征是high-cardinality

分别提取出里面的年月日

dataset['Date'] = pd.to_datetime(dataset['Date']) dataset['Date'] 0 2008-12-01 1 2008-12-02 2 2008-12-03 3 2008-12-04 4 2008-12-05 ... 142188 2017-06-20 142189 2017-06-21 142190 2017-06-22 142191 2017-06-23 142192 2017-06-24 Name: Date, Length: 142193, dtype: datetime64[ns] dataset['Date'].dt.year.unique() array([2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2007], dtype=int64) year = dataset['Date'].dt.year month = dataset['Date'].dt.month day = dataset['Date'].dt.day dataset.drop(labels=['Date'], axis=1, inplace=True) dataset['year'] = year dataset['month'] = month dataset['day'] = day dataset.shape (142193, 26) dataset.loc[:, catgorical].isnull().sum() / len(dataset) d:\anaconda_file\lib\site-packages\pandas\core\indexing.py:1404: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike return self._getitem_tuple(key) Date 1.000000 Location 0.000000 WindGustDir 0.065615 WindDir9am 0.070418 WindDir3pm 0.026570 RainToday 0.009888 RainTomorrow 0.000000 dtype: float64 # 对于这些离散值来说,缺失的值的占比都比较少,所以都使用众数来填充即可 dataset_ = dataset.copy() fill_list = ['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday'] fill_dict = {key: dataset_[key].mode().values[0] for key in fill_list} fill_dict # 这种方法是不行的 # for j in fill_list: # dataset_[j].fillna(dataset_[j].mode(), inplace=True) dataset_.fillna(value=fill_dict, inplace=True) cat = [_ for _ in dataset_.columns if dataset[_].dtype=='O'] dataset_.loc[:, cat].isnull().sum() Location 0 WindGustDir 0 WindDir9am 0 WindDir3pm 0 RainToday 0 RainTomorrow 0 dtype: int64

连续值

numerical = [_ for _ in dataset_.columns if dataset[_].dtype != 'O'] # 删去之前创建的 year, month, day # 因为数据集的提供者已经说了 RISK_MM 会影响模型,所以连着删去 numerical = numerical[:-4] numerical ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm'] numerical_features = dataset_.loc[:, numerical] numerical_features.isnull().sum() / numerical_features.shape[0] MinTemp 0.004480 MaxTemp 0.002265 Rainfall 0.009888 Evaporation 0.427890 Sunshine 0.476929 WindGustSpeed 0.065193 WindSpeed9am 0.009480 WindSpeed3pm 0.018496 Humidity9am 0.012476 Humidity3pm 0.025388 Pressure9am 0.098556 Pressure3pm 0.098324 Cloud9am 0.377353 Cloud3pm 0.401525 Temp9am 0.006358 Temp3pm 0.019171 dtype: float64 numerical_features.describe() MinTempMaxTempRainfallEvaporationSunshineWindGustSpeedWindSpeed9amWindSpeed3pmHumidity9amHumidity3pmPressure9amPressure3pmCloud9amCloud3pmTemp9amTemp3pmcount141556.000000141871.000000140787.00000081350.00000074377.000000132923.000000140845.000000139563.000000140419.000000138583.000000128179.000000128212.00000088536.00000085099.000000141289.000000139467.000000mean12.18640023.2267842.3499745.4698247.62485339.98429214.00198818.63757668.84381051.4826061017.6537581015.2582044.4371894.50316716.98750921.687235std6.4032837.1176188.4651734.1885373.78152513.5888018.8933378.80334519.05129320.7977727.1054767.0366772.8870162.7206336.4928386.937594min-8.500000-4.8000000.0000000.0000000.0000006.0000000.0000000.0000000.0000000.000000980.500000977.1000000.0000000.000000-7.200000-5.40000025%7.60000017.9000000.0000002.6000004.90000031.0000007.00000013.00000057.00000037.0000001012.9000001010.4000001.0000002.00000012.30000016.60000050.00000022.6000000.0000004.8000008.50000039.00000013.00000019.00000070.00000052.0000001017.6000001015.2000005.0000005.00000016.70000021.10000075.80000028.2000000.8000007.40000010.60000048.00000019.00000024.00000083.00000066.0000001022.4000001020.0000007.0000007.00000021.60000026.400000max33.90000048.100000371.000000145.00000014.500000135.000000130.00000087.000000100.000000100.0000001041.0000001039.6000009.0000009.00000040.20000046.700000

对比一下上四分位数和最大值,Rainfall、Evaporation、WindGustSpeed、WindSpeed9am、WindSpeed3pm

figure, axes = plt.subplots(2, 3, figsize=(15, 15)) sns.boxplot( x='RainTomorrow', y='Rainfall', data=dataset_, ax=axes[0, 0], palette="Set3" ) sns.boxplot( x='RainTomorrow', y='Evaporation', data=dataset_, ax=axes[0, 1], palette="Set3" ) sns.boxplot( x='RainTomorrow', y='WindGustSpeed', data=dataset_, ax=axes[1, 0], palette="Set3" ) sns.boxplot( x='RainTomorrow', y='WindSpeed9am', data=dataset_, ax=axes[1, 1], palette="Set3" ) sns.boxplot( x='RainTomorrow', y='WindSpeed3pm', data=dataset_, ax=axes[1, 2], palette="Set3" ) plt.show()

在箱线图中看到,对于这几个连续值,里面的异常值还是挺多的

# 查看连续值的分布情况 figure, axes = plt.subplots(2, 3, figsize=(30, 15)) sns.distplot( a=dataset_['Rainfall'].dropna(), ax=axes[0, 0] ) sns.distplot( a=dataset_['Evaporation'].dropna(), ax=axes[0, 1] ) sns.distplot( a=dataset_['WindGustSpeed'].dropna(), ax=axes[1, 0] ) sns.distplot( a=dataset_['WindSpeed9am'].dropna(), ax=axes[1, 1] ) sns.distplot( a=dataset_['WindSpeed3pm'].dropna(), ax=axes[1, 2] ) plt.show()

对于这几个连续型特征,都出现比较明显的偏斜,使用 interquantile range 去寻找离群值

_list = ['Rainfall','Evaporation','WindGustSpeed','WindSpeed9am','WindSpeed3pm'] def find_outliers(df, feature): IQR = df[feature].quantile(0.75) - df[feature].quantile(0.25) Lower_fence = df[feature].quantile(0.25) - (IQR * 3) Upper_fence = df[feature].quantile(0.75) + (IQR * 3) print('{feature} outliers are values < {lowerboundary} or > {upperboundary}'\ .format(feature=feature, lowerboundary=Lower_fence, upperboundary=Upper_fence)) out_of_middan = (df[feature] < Lower_fence).sum() out_of_top = (df[feature] > Upper_fence).sum() print(f'the number of upper outlier {out_of_top}') print(f'the number of lower outlier {out_of_middan}') for feature in _list: find_outliers(dataset_, feature) print() Rainfall outliers are values < -2.4000000000000004 or > 3.2 the number of upper outlier 20462 the number of lower outlier 0 Evaporation outliers are values < -11.800000000000002 or > 21.800000000000004 the number of upper outlier 471 the number of lower outlier 0 WindGustSpeed outliers are values < -20.0 or > 99.0 the number of upper outlier 150 the number of lower outlier 0 WindSpeed9am outliers are values < -29.0 or > 55.0 the number of upper outlier 107 the number of lower outlier 0 WindSpeed3pm outliers are values < -20.0 or > 57.0 the number of upper outlier 81 the number of lower outlier 0

对于 Rainfall 这个特征而言,Australia 的情况应该是,一下雨就降雨量很大的那种,所以不打算砍掉 Rainfall 的 上四分位数 + 3*IQR 以上的值,其他的4个特征全部砍掉这些异常值

# 先进行中位数填充缺失的连续值,在进行异常值的处理 # 当数据集中有离群值的时候,应当使用中位数进行填充 ''' 进行缺失值填充的时候要注意的点是: 要进行填充的值的计算,一定是要使用训练集计算出来的,这样才能减少过拟合。 使用训练集计算出来的中位数对训练集和测试集的对应特征进行填充 ''' def fill_max(feature): pass dataset_.drop(columns=['RISK_MM'], inplace=True) dataset_.shape (142193, 25) X = dataset_.drop(columns=['RainTomorrow']) y = dataset_['RainTomorrow'] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # train_test_split 进行训练集和测试集的分配之后,X_train 等都是 dataframe X_train.shape, X_test.shape ((113754, 24), (28439, 24)) # 计算训练集的中位数,用这个中位数填充 for df1 in (X_train, X_test): for j in numerical: col_median = X_train[j].median() df1[j].fillna(col_median, inplace=True) d:\anaconda_file\lib\site-packages\pandas\core\generic.py:6288: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._update_inplace(new_data) X_train.isnull().sum(), X_test.isnull().sum() (Location 0 MinTemp 0 MaxTemp 0 Rainfall 0 Evaporation 0 Sunshine 0 WindGustDir 0 WindGustSpeed 0 WindDir9am 0 WindDir3pm 0 WindSpeed9am 0 WindSpeed3pm 0 Humidity9am 0 Humidity3pm 0 Pressure9am 0 Pressure3pm 0 Cloud9am 0 Cloud3pm 0 Temp9am 0 Temp3pm 0 RainToday 0 year 0 month 0 day 0 dtype: int64, Location 0 MinTemp 0 MaxTemp 0 Rainfall 0 Evaporation 0 Sunshine 0 WindGustDir 0 WindGustSpeed 0 WindDir9am 0 WindDir3pm 0 WindSpeed9am 0 WindSpeed3pm 0 Humidity9am 0 Humidity3pm 0 Pressure9am 0 Pressure3pm 0 Cloud9am 0 Cloud3pm 0 Temp9am 0 Temp3pm 0 RainToday 0 year 0 month 0 day 0 dtype: int64) # 处理除了 Rainfall 之外的连续值的超出 上四分位数 + 3*IQR 的离群值进行修改 # 前面可以看到小于 下四分位数 - 3*IQR 的值是没有的 def process_outliers(df3, Top, feature_): return np.where(df3[feature_] > Top, Top, df3[feature_]) '''threshold Evaporation 21.800000000000004 WindGustSpeed 99.0 WindSpeed9am 55.0 WindSpeed3pm 57.0 ''' threshold_dict = {'Evaporation': 21.8, 'WindGustSpeed': 99.0, 'WindSpeed9am': 55.0, 'WindSpeed3pm': 57.0} _list = ['Evaporation', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm'] for df3 in (X_train, X_test): for feature in _list: top = threshold_dict.get(feature) df3[feature] = process_outliers(df3, top, feature) d:\anaconda_file\lib\site-packages\ipykernel_launcher.py:33: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X_train[_list].max(), X_test[_list].max() (Evaporation 21.8 WindGustSpeed 99.0 WindSpeed9am 55.0 WindSpeed3pm 57.0 dtype: float64, Evaporation 21.8 WindGustSpeed 99.0 WindSpeed9am 55.0 WindSpeed3pm 57.0 dtype: float64) # 对离散值进行独热编码 catgorical = [ 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', ] for i in catgorical: print(X_train.loc[:, i].value_counts()) Canberra 2717 Sydney 2671 Hobart 2593 Darwin 2560 Brisbane 2550 Perth 2548 Ballarat 2448 Adelaide 2446 MountGambier 2434 Sale 2420 Watsonia 2420 MelbourneAirport 2415 Nuriootpa 2413 Bendigo 2412 PerthAirport 2412 AliceSprings 2411 Woomera 2410 Cobar 2408 SydneyAirport 2407 Launceston 2404 WaggaWagga 2403 Tuggeranong 2402 Albury 2393 Townsville 2389 Wollongong 2385 Portland 2384 Albany 2382 BadgerysCreek 2381 NorfolkIsland 2380 Penrith 2379 Newcastle 2378 CoffsHarbour 2377 Cairns 2372 Mildura 2369 Dartmoor 2358 Witchcliffe 2356 GoldCoast 2346 NorahHead 2344 Richmond 2340 SalmonGums 2338 MountGinini 2323 Moree 2300 Walpole 2256 PearceRAAF 2225 Williamtown 2022 Melbourne 1926 Nhil 1282 Uluru 1238 Katherine 1227 Name: Location, dtype: int64 W 15350 SE 7463 E 7255 N 7211 SSE 7197 S 7172 WSW 7164 SW 7028 SSW 6888 WNW 6427 ENE 6360 NW 6351 ESE 5797 NE 5674 NNW 5251 NNE 5166 Name: WindGustDir, dtype: int64 N 17112 SE 7286 E 7213 SSE 7107 NW 6904 S 6771 SW 6608 W 6557 NNE 6326 NNW 6294 ENE 6214 NE 6073 SSW 6034 ESE 6034 WNW 5755 WSW 5466 Name: WindDir9am, dtype: int64 SE 11540 W 7967 S 7646 WSW 7469 SW 7373 SSE 7275 N 6988 WNW 6913 NW 6732 ESE 6722 E 6669 NE 6524 SSW 6371 ENE 6226 NNW 6203 NNE 5136 Name: WindDir3pm, dtype: int64 No 88530 Yes 25224 Name: RainToday, dtype: int64 X_train = X_train.replace({'No': 0, 'Yes': 1}) X_test = X_test.replace({'No': 0, 'Yes': 1}) X_train_temp = X_train.copy() X_test_temp = X_test.copy() X_train_temp = pd.get_dummies(X_train_temp, columns=catgorical, drop_first=True) X_test_temp = pd.get_dummies(X_test_temp, columns=catgorical, drop_first=True) X_train_temp.shape, X_test_temp.shape ((113754, 113), (28439, 113)) X_train_temp MinTempMaxTempRainfallEvaporationSunshineWindGustSpeedWindSpeed9amWindSpeed3pmHumidity9amHumidity3pm...WindDir3pm_NWWindDir3pm_SWindDir3pm_SEWindDir3pm_SSEWindDir3pm_SSWWindDir3pm_SWWindDir3pm_WWindDir3pm_WNWWindDir3pm_WSWRainToday_113635625.732.40.06.28.937.019.017.077.056.0...100000000078594.816.80.05.08.557.020.030.025.018.0...0000000100506873.821.20.04.88.522.04.09.073.032.0...00000000009884312.119.27.61.65.331.015.020.0100.069.0...000000010155688.420.10.24.88.520.06.04.093.052.0...00000000006568015.222.18.42.25.946.030.028.083.064.0...00000000017758616.017.317.21.80.020.013.011.070.0100.0...00000000012732316.430.60.05.28.539.02.026.084.060.0...00100000001408687.634.20.04.88.561.04.037.022.011.0...1000000000238910.223.10.04.88.517.07.011.066.053.0...000100000013737718.230.40.03.89.828.07.017.069.053.0...00000000002173220.826.70.06.08.039.022.028.066.066.0...00000000001770219.826.70.04.88.552.022.041.078.069.0...00000000001277836.613.60.00.64.433.06.022.081.072.0...0000000000525454.97.00.04.88.554.017.015.099.098.0...000000001039355.624.60.04.88.530.07.011.043.025.0...00000000001162329.719.20.03.09.139.028.019.054.040.0...0000000000579322.916.10.04.88.535.017.013.073.055.0...000100000091819.125.60.04.211.156.04.041.037.058.0...00000000009868612.917.22.45.68.956.028.019.057.076.0...00000100016306912.719.69.44.610.352.028.033.070.051.0...000100000111014013.828.00.04.88.528.09.09.070.052.0...00000000002230015.821.10.05.611.039.020.026.062.063.0...010000000010287314.621.10.55.16.661.026.031.088.069.0...00000010007598711.517.50.05.68.863.015.035.074.039.0...00000000001310529.621.70.04.88.531.04.020.089.066.0...00000000002761219.128.20.44.88.530.06.022.089.060.0...0000000000800277.516.80.81.46.430.011.013.091.071.0...0000010000556882.017.60.24.88.539.09.09.079.046.0...000000000012572814.617.211.64.88.548.019.020.096.081.0...0000000011..................................................................3202614.722.40.45.26.035.013.020.082.058.0...010000000012136.219.50.04.88.533.00.019.093.064.0...00000001009113521.929.60.011.011.146.026.031.054.043.0...0000000000536591.78.20.04.88.526.013.09.095.079.0...000000001038115.121.30.04.88.526.02.013.0100.055.0...000000000010902716.429.60.04.88.556.030.026.078.053.0...00010000001342082.119.50.06.010.743.017.019.040.013.0...00000000007119016.932.40.04.88.548.020.017.052.023.0...00000000001334623.718.20.03.89.937.09.020.068.040.0...00000000007153412.738.50.04.88.533.022.07.051.08.0...010000000011055311.115.37.44.88.544.017.013.053.074.0...0000000101467627.215.20.04.88.550.024.033.046.046.0...00000001004960913.419.87.04.88.557.026.022.068.048.0...0100000001563317.911.82.64.88.548.030.022.0100.096.0...01000000011259948.614.521.84.88.546.030.015.066.066.0...0000000011628215.525.00.010.48.548.019.017.060.017.0...000000001013664821.532.60.08.08.943.020.015.048.034.0...0000000000609468.215.027.02.05.274.09.028.082.063.0...1000000001582247.218.40.01.98.522.013.09.064.042.0...0000100000440158.916.80.04.88.537.09.017.046.037.0...000100000012865012.720.90.24.08.759.09.022.051.029.0...1000000000992615.723.76.61.64.726.09.019.078.064.0...0000000001378513.126.70.24.88.552.013.028.058.035.0...0000000100553145.521.50.04.88.522.011.07.099.048.0...0000000000804505.732.90.06.413.231.011.07.066.023.0...000000000010334416.329.30.019.213.459.037.030.043.021.0...01000000008941913.719.00.04.88.546.020.015.092.099.0...0000000000639477.115.922.42.43.243.019.020.082.067.0...0000000011582679.511.210.41.48.531.019.013.096.098.0...00100000018294618.825.044.43.60.243.09.013.090.075.0...0000000001

113754 rows × 113 columns

from sklearn.preprocessing import StandardScaler X = pd.concat([X_train_temp, X_test_temp]) y = pd.concat([y_train, y_test]) print(X.shape) print(y.shape) scaler = StandardScaler() X_train_temp = scaler.fit_transform(X_train_temp) X_test_temp = scaler.fit_transform(X_test_temp) y (142193, 113) (142193,) 136356 Yes 7859 No 50687 No 98843 No 5568 No ... 67274 No 107403 Yes 69336 No 48522 No 4650 Yes Name: RainTomorrow, Length: 142193, dtype: object

建模

逻辑回归

1.直接套模型 2.递归特征消除 3.嵌入法 from sklearn.linear_model import LogisticRegression LR = LogisticRegression() LR.fit(X_train_temp, y_train) d:\anaconda_file\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning) LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False) LR.score(X_test_temp, y_test) 0.8482365765322268 %%time import warnings from sklearn.model_selection import cross_val_score warnings.filterwarnings('ignore') LR = LogisticRegression(n_jobs=-1) cross_val_score(LR, X, y, cv=10, n_jobs=-1) Wall time: 25.1 s array([0.84331927, 0.84472574, 0.84739803, 0.84704641, 0.84535865, 0.84936709, 0.84893452, 0.8432269 , 0.84589956, 0.84414123]) LR = LogisticRegression() LR.fit(X_train_temp, y_train) LR.score(X_train_temp, y_train) 0.8483218172547778 LR.score(X_test_temp, y_test) 0.8482365765322268

使用训练集的准确率和使用测试集的准确率差不多

模型评估
from sklearn.metrics import confusion_matrix y_pre_test = LR.predict(X_test_temp) cm = confusion_matrix(y_test, y_pre_test) cm array([[20895, 1217], [ 3099, 3228]], dtype=int64) cm_matrix = pd.DataFrame(cm, columns=['true', 'false'], index=['postive', 'negative']) plt.figure(figsize=(7,7)) sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu') <matplotlib.axes._subplots.AxesSubplot at 0x24e0a40f240>

from sklearn.metrics import classification_report print(classification_report(y_test, y_pre_test)) precision recall f1-score support No 0.87 0.94 0.91 22112 Yes 0.73 0.51 0.60 6327 accuracy 0.85 28439 macro avg 0.80 0.73 0.75 28439 weighted avg 0.84 0.85 0.84 28439

from sklearn.metrics import recall_score, precision_score y_test = np.where(y_test=='No', 0, 1) y_pre_test = np.where(y_pre_test=='No', 0, 1) print(precision_score(y_test, y_pre_test)) # 原来这个指标是可以设置 pos_label 的,这样即使是字符串的标签 'No' 'Yes' 也不用鸟了 print(recall_score(y_test, y_pre_test)) 0.7262092238470191 0.510194404931247 from sklearn.metrics import f1_score f1_score(y_test, y_pre_test) 0.5993316004455997 0.8*0.8*2/1.6 0.8000000000000002 from sklearn.metrics import roc_curve fpr, tpr, threshold = roc_curve(y_test, y_pre_test) plt.plot(fpr, tpr, c='b') plt.plot([0,1], [0,1], 'k--') [<matplotlib.lines.Line2D at 0x24e07f3b1d0>]

from sklearn.metrics import roc_auc_score ROC_AUC = roc_auc_score(y_test, y_pre_test) ROC_AUC 0.7275782082543355 %%time from sklearn.feature_selection import RFECV rfecv = RFECV(estimator=LR, step=1, cv=5, scoring='accuracy') rfecv = rfecv.fit(X_train_temp, y_train) Wall time: 17min 18s X_train_rfecv = rfecv.transform(X_train_temp) print(X_train_rfecv.shape) LR.fit(X_train_rfecv, y_train) X_test_rfecv = rfecv.transform(X_test_temp) y_pred_rfecv = LR.predict(X_test_rfecv) (113754, 100) from sklearn.metrics import accuracy_score y_test = np.where(y_test==0, 'No', 'Yes') LR.score(X_test_rfecv, y_test) 0.8479552726889131 %%time from sklearn.model_selection import GridSearchCV parameters = [ {'C':list(range(50, 500, 25))} ] LR = LogisticRegression(penalty='l1', n_jobs=-1) grid_search = GridSearchCV(estimator = LR, param_grid = parameters, scoring = 'accuracy', cv = 5, verbose=0) grid_search.fit(X_train_temp, y_train) Wall time: 9min 8s GridSearchCV(cv=5, error_score='raise-deprecating', estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=-1, penalty='l1', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False), iid='warn', n_jobs=None, param_grid=[{'C': [50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475]}], pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='accuracy', verbose=0) grid_search.best_score_, grid_search.best_params_ (0.8479174358703869, {'C': 125}) %%time # 嵌入法 from sklearn.feature_selection import SelectFromModel X_embedded = SelectFromModel(grid_search.best_estimator_, norm_order=1).fit_transform(X_train_temp, y_train) Wall time: 3.6 s

查看模型的效果----通过模型预测概率的分布

# LR.predict_proba(X_test) 可以获得对应的样本的预测值为 1 的概率 LR = LogisticRegression(C=125) LR.fit(X_train_temp, y_train) y_predict_proba = LR.predict_proba(X_test_temp) y_predict_proba # 第0列和第一列分别为这个样本对应的标签为0和1的概率 array([[0.79898701, 0.20101299], [0.86553243, 0.13446757], [0.77149679, 0.22850321], ..., [0.98964495, 0.01035505], [0.83904762, 0.16095238], [0.17714232, 0.82285768]]) sns.set(style="whitegrid") sns.distplot(a=y_predict_proba[:, 1], bins=10) <matplotlib.axes._subplots.AxesSubplot at 0x24e42f53cc0>

plt.hist(y_predict_proba[:, 1], bins=10) (array([13451., 4906., 2661., 1638., 1341., 1073., 908., 809., 838., 814.]), array([8.40201955e-04, 1.00724269e-01, 2.00608337e-01, 3.00492404e-01, 4.00376471e-01, 5.00260538e-01, 6.00144606e-01, 7.00028673e-01, 7.99912740e-01, 8.99796807e-01, 9.99680875e-01]), <a list of 10 Patch objects>)

sns.distplot(a=y_predict_proba[:, 0], bins=10) <matplotlib.axes._subplots.AxesSubplot at 0x24e67ab34e0>

1.概率分布严重偏斜 2.可以发现对于标签为 1 的样本,大部分预测的概率都是小于0.5的,所以感觉置信度不太高

# todo 对于样本严重的不平衡问题,使用上采样的方法

随机森林

from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(n_estimators=200) rfc.fit(X_train_temp, y_train) RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False) rfc.score(X_test_temp, y_test) # 我吐了,调了那么久的LR,应该全部模型都跑一遍。。 0.856886669714125
随机森林调参
%%time m_list = [] for depth in range(5, 100, 4): rfc = RandomForestClassifier(n_estimators=150, max_depth=depth) rfc.fit(X_train_temp, y_train) score = rfc.score(X_test_temp, y_test) m_list.append(score) plt.plot(m_list) plt.show()

Wall time: 17min 9s m_list.index(max(m_list)) 14 m_list[14] 0.858328351911108 score_list = list() for num in range(500, 1500, 100): rfc = RandomForestClassifier(n_estimators=200, max_depth=14, max_leaf_nodes=num, n_jobs=-1) rfc.fit(X_train_temp, y_train) score = rfc.score(X_test_temp, y_test) score_list.append(score) plt.plot(list(range(500, 1500, 100)),score_list) plt.show()

%matplotlib inline # 显然预测的评分是被限制了,max_leaf_nodes还要提高 del score_list score_list = [] for k in range(1500, 2500, 100): rfc = RandomForestClassifier(n_estimators=200, max_depth=14, max_leaf_nodes=k, n_jobs=-1) rfc.fit(X_train_temp, y_train) score = rfc.score(X_test_temp, y_test) score_list.append(score) plt.plot(range(1500, 2500, 100), score_list) plt.show()

%%time rfc = RandomForestClassifier(n_estimators=200, max_depth=19, max_features=17, max_leaf_nodes=1100, n_jobs=-1) rfc.fit(X_train_temp, y_train) rfc.score(X_test_temp, y_test) Wall time: 24.9 s 0.8518583635148915

朴素贝叶斯

# 高斯朴素贝叶斯 from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB gaussian_ = GaussianNB() gaussian_.fit(X_train_temp, y_train) gaussian_.score(X_test_temp, y_test) 0.6483701958578009

人工神经网络

%%time from sklearn.neural_network import MLPClassifier clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(50, 50), random_state=1) clf.fit(X_train_temp, y_train) Wall time: 2min 44s MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, hidden_layer_sizes=(50, 50, 50), learning_rate='constant', learning_rate_init=0.001, max_iter=200, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False) clf.score(X_test_temp, y_test) 0.8431731073525792

最新回复(0)