机器学习数据科学包第二天---pandas(2)

mac2024-11-02  11

1.reindex pandas中的reindex方法可以为series和dataframe添加或者删除索引。 方法:serise.reindex()、dataframe.reindex()

In[2]: import numpy as np In[3]: import pandas as pd In[6]: df=pd.DataFrame(np.random.random((6,4)),index=dates,columns=list('ABCD')) In[7]: df Out[7]: A B C D 2016-01-01 0.196507 0.019824 0.330309 0.289678 2016-01-02 0.658054 0.236342 0.429518 0.322824 2016-01-03 0.369265 0.855443 0.161498 0.763341 2016-01-04 0.383210 0.953314 0.364178 0.719806 2016-01-05 0.135650 0.713118 0.609763 0.752052 2016-01-06 0.470165 0.002966 0.393910 0.137355 In[10]: df1=df.reindex(index=dates[0:4],columns=list(df.columns)+['E']) In[11]: df1 Out[11]: A B C D E 2016-01-01 0.196507 0.019824 0.330309 0.289678 NaN 2016-01-02 0.658054 0.236342 0.429518 0.322824 NaN 2016-01-03 0.369265 0.855443 0.161498 0.763341 NaN 2016-01-04 0.383210 0.953314 0.364178 0.719806 NaN

2.处理NaN值有关的 .fillna(value):将NaN的值填为value .isnull():返回一个全是bool型的和原dataframe大小相同的,判断每个位置元素值是不是NaN .any():查看各列或行或是列中元素是否有NaN NaN不参与求均值求和等的计算

In[12]: df.columns Out[12]: Index(['A', 'B', 'C', 'D'], dtype='object') In[13]: df1.loc[dates[1:3],'E']=2 In[14]: df1 Out[14]: A B C D E 2016-01-01 0.196507 0.019824 0.330309 0.289678 NaN 2016-01-02 0.658054 0.236342 0.429518 0.322824 2.0 2016-01-03 0.369265 0.855443 0.161498 0.763341 2.0 2016-01-04 0.383210 0.953314 0.364178 0.719806 NaN In[15]: df1.dropna() Out[15]: A B C D E 2016-01-02 0.658054 0.236342 0.429518 0.322824 2.0 2016-01-03 0.369265 0.855443 0.161498 0.763341 2.0 In[16]: df1.fillna(value=5) Out[16]: A B C D E 2016-01-01 0.196507 0.019824 0.330309 0.289678 5.0 2016-01-02 0.658054 0.236342 0.429518 0.322824 2.0 2016-01-03 0.369265 0.855443 0.161498 0.763341 2.0 2016-01-04 0.383210 0.953314 0.364178 0.719806 5.0 In[17]: pd.isnull(df1) Out[17]: A B C D E 2016-01-01 False False False False True 2016-01-02 False False False False False 2016-01-03 False False False False False 2016-01-04 False False False False True In[18]: pd.isnull(df1).any() Out[18]: A False B False C False D False E True dtype: bool In[19]: pd.isnull(df1).any().any() Out[19]: True In[20]: df1.mean() #空数据不参与平均值的计算 Out[20]: A 0.401759 B 0.516231 C 0.321376 D 0.523912 E 2.000000 dtype: float64 In[21]: df1.cumsum() Out[21]: A B C D E 2016-01-01 0.196507 0.019824 0.330309 0.289678 NaN 2016-01-02 0.854561 0.256167 0.759827 0.612502 2.0 2016-01-03 1.223826 1.111609 0.921325 1.375843 4.0 2016-01-04 1.607035 2.064923 1.285503 2.095649 NaN

3.一些函数 .apply(函数名称):对pandas数据应用参数中的函数 .val_counts():每个值以及这个值的元素个数 .mode():众数

In[29]: df.apply(np.cumsum) Out[29]: A B C D 2016-01-01 0.196507 0.019824 0.330309 0.289678 2016-01-02 0.854561 0.256167 0.759827 0.612502 2016-01-03 1.223826 1.111609 0.921325 1.375843 2016-01-04 1.607035 2.064923 1.285503 2.095649 2016-01-05 1.742685 2.778041 1.895266 2.847701 2016-01-06 2.212849 2.781007 2.289176 2.985057 In[30]: df.apply(lambda x:x.max()-x.min()) Out[30]: A 0.522404 B 0.950347 C 0.448265 D 0.625986 dtype: float64 In[31]: def _sum(x): ...: print(type(x)) ...: return x.sum() ...: df.apply(_sum) <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> Out[31]: A 2.212849 B 2.781007 C 2.289176 D 2.985057 dtype: float64 In[32]: s=pd.Series(np.random.randint(10,20,size=20)) In[33]: s Out[33]: 0 18 1 16 2 11 3 11 4 17 5 14 6 18 7 17 8 13 9 13 10 11 11 11 12 16 13 17 14 17 15 13 16 12 17 15 18 14 19 10 dtype: int32 In[34]: s.value_counts() Out[34]: 17 4 11 4 13 3 18 2 16 2 14 2 15 1 12 1 10 1 dtype: int64 In[35]: s.mode() Out[35]: 0 11 1 17 dtype: int32

.concat():合并多个pandas里的数据结构 .all():类似于.any(),对行或列的操作

In[36]: df=pd.DataFrame(np.random.random((10,4)),columns=list('ABCD')) In[37]: df Out[37]: A B C D 0 0.309187 0.109764 0.237555 0.878088 1 0.008201 0.082768 0.939499 0.755231 2 0.203507 0.882972 0.166033 0.899489 3 0.528112 0.976442 0.405005 0.476885 4 0.557219 0.404936 0.975680 0.312243 5 0.990264 0.643145 0.396265 0.936465 6 0.027994 0.552443 0.277969 0.985753 7 0.325813 0.469911 0.432550 0.276821 8 0.805665 0.230130 0.561014 0.673377 9 0.010867 0.485834 0.512464 0.527696 In[38]: df.iloc[1:3] Out[38]: A B C D 1 0.008201 0.082768 0.939499 0.755231 2 0.203507 0.882972 0.166033 0.899489 In[39]: df.iloc[3:7] Out[39]: A B C D 3 0.528112 0.976442 0.405005 0.476885 4 0.557219 0.404936 0.975680 0.312243 5 0.990264 0.643145 0.396265 0.936465 6 0.027994 0.552443 0.277969 0.985753 In[40]: df.iloc[7:] Out[40]: A B C D 7 0.325813 0.469911 0.432550 0.276821 8 0.805665 0.230130 0.561014 0.673377 9 0.010867 0.485834 0.512464 0.527696 In[41]: df1=pd.concat([df.iloc[1:3],df.iloc[3:7],df.iloc[7:]]) In[42]: df1 Out[42]: A B C D 1 0.008201 0.082768 0.939499 0.755231 2 0.203507 0.882972 0.166033 0.899489 3 0.528112 0.976442 0.405005 0.476885 4 0.557219 0.404936 0.975680 0.312243 5 0.990264 0.643145 0.396265 0.936465 6 0.027994 0.552443 0.277969 0.985753 7 0.325813 0.469911 0.432550 0.276821 8 0.805665 0.230130 0.561014 0.673377 9 0.010867 0.485834 0.512464 0.527696 In[46]: df1=pd.concat([df.iloc[:3],df.iloc[3:7],df.iloc[7:]]) In[47]: df1 Out[47]: A B C D 0 0.309187 0.109764 0.237555 0.878088 1 0.008201 0.082768 0.939499 0.755231 2 0.203507 0.882972 0.166033 0.899489 3 0.528112 0.976442 0.405005 0.476885 4 0.557219 0.404936 0.975680 0.312243 5 0.990264 0.643145 0.396265 0.936465 6 0.027994 0.552443 0.277969 0.985753 7 0.325813 0.469911 0.432550 0.276821 8 0.805665 0.230130 0.561014 0.673377 9 0.010867 0.485834 0.512464 0.527696 In[48]: df==df1 Out[48]: A B C D 0 True True True True 1 True True True True 2 True True True True 3 True True True True 4 True True True True 5 True True True True 6 True True True True 7 True True True True 8 True True True True 9 True True True True In[49]: (df==df1).all() Out[49]: A True B True C True D True dtype: bool In[50]: (df==df1).all().all() Out[50]: True

.merge():合并pandas的数据结构,如果指定on=’’,相当于数据库的内连接操作,保留公共主键 .append():可以在表格追加一行Series这样的结构

In[4]: left=pd.DataFrame({'key':['foo','foo'],'lval':[1,2]}) In[5]: right=pd.DataFrame({'key':['foo','foo'],'rval':[4,5]}) In[6]: left Out[6]: key lval 0 foo 1 1 foo 2 In[7]: right Out[7]: key rval 0 foo 4 1 foo 5 In[9]: pd.merge(left,right,on='key') Out[9]: key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5 In[10]: df=pd.DataFrame(np.random.random((10,4)),columns=list('ABCD')) In[11]: df Out[11]: A B C D 0 0.917928 0.396669 0.610322 0.666497 1 0.508628 0.718526 0.111458 0.577437 2 0.638503 0.974669 0.321745 0.873634 3 0.792875 0.402313 0.730289 0.441393 4 0.554515 0.133467 0.598767 0.487647 5 0.366506 0.740182 0.365446 0.167777 6 0.483763 0.080658 0.259861 0.766983 7 0.754270 0.632751 0.117197 0.149026 8 0.041530 0.155985 0.052181 0.960280 9 0.469961 0.825286 0.781265 0.888180 In[12]: s=pd.Series(np.random.randint(1,5,size=4),index=list('ABCD')) In[13]: df.append(s,ignore_index=True) Out[13]: A B C D 0 0.917928 0.396669 0.610322 0.666497 1 0.508628 0.718526 0.111458 0.577437 2 0.638503 0.974669 0.321745 0.873634 3 0.792875 0.402313 0.730289 0.441393 4 0.554515 0.133467 0.598767 0.487647 5 0.366506 0.740182 0.365446 0.167777 6 0.483763 0.080658 0.259861 0.766983 7 0.754270 0.632751 0.117197 0.149026 8 0.041530 0.155985 0.052181 0.960280 9 0.469961 0.825286 0.781265 0.888180 10 2.000000 2.000000 3.000000 2.000000 In[15]: s=pd.Series(np.random.randint(1,5,size=5),index=list('ABCDE')) In[16]: s=pd.Series(np.random.randint(1,5,size=5),index=list('ABCDE')) In[17]: df.append(s) In[18]:df.append(s,ignore_index=True) Out[18]: A B C D E 0 0.917928 0.396669 0.610322 0.666497 NaN 1 0.508628 0.718526 0.111458 0.577437 NaN 2 0.638503 0.974669 0.321745 0.873634 NaN 3 0.792875 0.402313 0.730289 0.441393 NaN 4 0.554515 0.133467 0.598767 0.487647 NaN 5 0.366506 0.740182 0.365446 0.167777 NaN 6 0.483763 0.080658 0.259861 0.766983 NaN 7 0.754270 0.632751 0.117197 0.149026 NaN 8 0.041530 0.155985 0.052181 0.960280 NaN 9 0.469961 0.825286 0.781265 0.888180 NaN 10 1.000000 4.000000 4.000000 4.000000 1.0
最新回复(0)