机器学习笔记(二)——Numpy
介绍
Numpy(Numerical Python)是一个开源的Python科学计算库,用于快速处理任意维度的数组。
Numpy支持常见的数组和矩阵操作。对于同样的数值计算任务,使用Numpy比直接使用Python要简洁的多。
Numpy使用ndarray对象来处理多维数组,该对象是一个快速而灵活的大数据容器。
意义
ndarray与Python原生list运算效率对比
import random
import time
import numpy
as np
a
= []
for i
in range(10000000):
a
.append
(random
.random
())
%time sum1
=sum(a
)
b
=np
.array
(a
)
%time sum2
=np
.sum(b
)
Wall time
: 52 ms
Wall time
: 13 ms
从中我们看到ndarray的计算速度要快很多,节约了时间。
机器学习的最大特点就是大量的数据运算,那么如果没有一个快速的解决方案,那可能现在python也在机器学习领域达不到好的效果。
Numpy专门针对ndarray的操作和运算进行了设计,所以数组的存储效率和输入输出性能远优于Python中的嵌套列表,数组越大,Numpy的优势就越明显。
Numpy底层使用C语言编写,内部解除了GIL(全局解释器锁),其对数组的操作速度不受Python解释器的限制,所以,其效率远高于纯Python代码。
基础数据结构
NumPy数组是一个多维数组对象,称为ndarray。其由两部分组成:
① 实际的数据
② 描述这些数据的元数据
多维数组ndarray
import numpy
as np
ar
= np
.array
([1,2,3,4,5,6,7])
print(ar
)
print(ar
.ndim
)
print(ar
.shape
)
print(ar
.size
)
print(ar
.dtype
)
print(ar
.itemsize
)
print(ar
.data
)
ar
[1 2 3 4 5 6 7]
1
(7,)
7
int32
4
<memory at
0x0000000005509348>
array
([1, 2, 3, 4, 5, 6, 7])
创建数组
创建数组:array()函数,括号内可以是列表、元祖、数组、生成器等
ar1
= np
.array
(range(10))
ar2
= np
.array
([1,2,3.14,4,5])
ar3
= np
.array
([[1,2,3],('a','b','c')])
ar4
= np
.array
([[1,2,3],('a','b','c','d')])
print(ar1
,type(ar1
),ar1
.dtype
)
print(ar2
,type(ar2
),ar2
.dtype
)
print(ar3
,ar3
.shape
,ar3
.ndim
,ar3
.size
)
print(ar4
,ar4
.shape
,ar4
.ndim
,ar4
.size
)
[0 1 2 3 4 5 6 7 8 9] <class 'numpy.ndarray'> int32
[1. 2. 3.14 4. 5. ] <class 'numpy.ndarray'> float64
[['1' '2' '3']
['a' 'b' 'c']] (2, 3) 2 6
[list([1, 2, 3]) ('a', 'b', 'c', 'd')] (2,) 1 2
创建数组:arange(),类似range(),在给定间隔内返回均匀间隔的值。
print(np
.arange
(10))
print(np
.arange
(10.0))
print(np
.arange
(5,12))
print(np
.arange
(5.0,12,2))
print(np
.arange
(10000))
[0 1 2 3 4 5 6 7 8 9]
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[ 5 6 7 8 9 10 11]
[ 5. 7. 9. 11.]
[ 0 1 2 ... 9997 9998 9999]
创建数组:linspace():返回在间隔[开始,停止]上计算的num个均匀间隔的样本。
ar1
= np
.linspace
(2.0, 3.0, num
=5)
ar2
= np
.linspace
(2.0, 3.0, num
=5, endpoint
=False)
ar3
= np
.linspace
(2.0, 3.0, num
=5, retstep
=True)
print(ar1
,type(ar1
))
print(ar2
)
print(ar3
,type(ar3
))
[2. 2.25 2.5 2.75 3. ] <class 'numpy.ndarray'>
[2. 2.2 2.4 2.6 2.8]
(array
([2. , 2.25, 2.5 , 2.75, 3. ]), 0.25) <class 'tuple'>
创建数组:zeros()/zeros_like()/ones()/ones_like()
ar1
= np
.zeros
(5)
ar2
= np
.zeros
((2,2), dtype
= np
.int)
print(ar1
,ar1
.dtype
)
print(ar2
,ar2
.dtype
)
print('------')
ar3
= np
.array
([list(range(5)),list(range(5,10))])
ar4
= np
.zeros_like
(ar3
)
print(ar3
)
print(ar4
)
print('------')
ar5
= np
.ones
(9)
ar6
= np
.ones
((2,3,4))
ar7
= np
.ones_like
(ar3
)
print(ar5
)
print(ar6
)
print(ar7
)
[0. 0. 0. 0. 0.] float64
[[0 0]
[0 0]] int32
------
[[0 1 2 3 4]
[5 6 7 8 9]]
[[0 0 0 0 0]
[0 0 0 0 0]]
------
[1. 1. 1. 1. 1. 1. 1. 1. 1.]
[[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]]
[[1 1 1 1 1]
创建数组:eye()
print(np
.eye
(5))
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
ndarray的类型
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-s75wm9Kt-1572572806624)(/media/editor/01_20191031100242949135.png)]
通用函数
数组形状
.T/.reshape()/.resize()
ar1
= np
.arange
(10)
ar2
= np
.ones
((5,2))
print(ar1
,'\n',ar1
.T
)
print(ar2
,'\n',ar2
.T
)
print('------')
ar3
= ar1
.reshape
(2,5)
ar4
= np
.zeros
((4,6)).reshape
(3,8)
ar5
= np
.reshape
(np
.arange
(12),(3,4))
print(ar1
,'\n',ar3
)
print(ar4
)
print(ar5
)
print('------')
ar6
= np
.resize
(np
.arange
(5),(3,4))
print(ar6
)
[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]
[[1. 1.]
[1. 1.]
[1. 1.]
[1. 1.]
[1. 1.]]
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
------
[0 1 2 3 4 5 6 7 8 9]
[[0 1 2 3 4]
[5 6 7 8 9]]
[[0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0.]]
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
------
[[0 1 2 3]
[4 0 1 2]
[3 4 0 1]]
数组的复制
ar1
= np
.arange
(10)
ar2
= ar1
print(ar2
is ar1
)
ar1
[2] = 9
print(ar1
,ar2
)
ar3
= ar1
.copy
()
print(ar3
is ar1
)
ar1
[0] = 9
print(ar1
,ar3
)
True
[0 1 9 3 4 5 6 7 8 9] [0 1 9 3 4 5 6 7 8 9]
False
[9 1 9 3 4 5 6 7 8 9] [0 1 9 3 4 5 6 7 8 9]
数组类型转换
.astype()
ar1
= np
.arange
(10,dtype
=float)
print(ar1
,ar1
.dtype
)
print('-----')
ar2
= ar1
.astype
(np
.int32
)
print(ar2
,ar2
.dtype
)
print(ar1
,ar1
.dtype
)
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] float64
-----
[0 1 2 3 4 5 6 7 8 9] int32
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] float64
数组堆叠
a
= np
.arange
(5)
b
= np
.arange
(5,9)
ar1
= np
.hstack
((a
,b
))
print(a
,a
.shape
)
print(b
,b
.shape
)
print(ar1
,ar1
.shape
)
a
= np
.array
([[1],[2],[3]])
b
= np
.array
([['a'],['b'],['c']])
ar2
= np
.hstack
((a
,b
))
print(a
,a
.shape
)
print(b
,b
.shape
)
print(ar2
,ar2
.shape
)
print('-----')
a
= np
.arange
(5)
b
= np
.arange
(5,10)
ar1
= np
.vstack
((a
,b
))
print(a
,a
.shape
)
print(b
,b
.shape
)
print(ar1
,ar1
.shape
)
a
= np
.array
([[1],[2],[3]])
b
= np
.array
([['a'],['b'],['c'],['d']])
ar2
= np
.vstack
((a
,b
))
print(a
,a
.shape
)
print(b
,b
.shape
)
print(ar2
,ar2
.shape
)
print('-----')
a
= np
.arange
(5)
b
= np
.arange
(5,10)
ar1
= np
.stack
((a
,b
))
ar2
= np
.stack
((a
,b
),axis
= 1)
print(a
,a
.shape
)
print(b
,b
.shape
)
print(ar1
,ar1
.shape
)
print(ar2
,ar2
.shape
)
[0 1 2 3 4] (5,)
[5 6 7 8] (4,)
[0 1 2 3 4 5 6 7 8] (9,)
[[1]
[2]
[3]] (3, 1)
[['a']
['b']
['c']] (3, 1)
[['1' 'a']
['2' 'b']
['3' 'c']] (3, 2)
-----
[0 1 2 3 4] (5,)
[5 6 7 8 9] (5,)
[[0 1 2 3 4]
[5 6 7 8 9]] (2, 5)
[[1]
[2]
[3]] (3, 1)
[['a']
['b']
['c']
['d']] (4, 1)
[['1']
['2']
['3']
['a']
['b']
['c']
['d']] (7, 1)
-----
[0 1 2 3 4] (5,)
[5 6 7 8 9] (5,)
[[0 1 2 3 4]
[5 6 7 8 9]] (2, 5)
[[0 5]
[1 6]
[2 7]
[3 8]
[4 9]] (5, 2)
数组拆分
ar
= np
.arange
(16).reshape
(4,4)
ar1
= np
.hsplit
(ar
,2)
print(ar
)
print(ar1
,type(ar1
))
ar2
= np
.vsplit
(ar
,4)
print(ar2
,type(ar2
))
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[array
([[ 0, 1],
[ 4, 5],
[ 8, 9],
[12, 13]]), array
([[ 2, 3],
[ 6, 7],
[10, 11],
[14, 15]])] <class 'list'>
[array
([[0, 1, 2, 3]]), array
([[4, 5, 6, 7]]), array
([[ 8, 9, 10, 11]]), array
([[12, 13, 14, 15]])] <class 'list'>
数组简单运算
ar
= np
.arange
(6).reshape
(2,3)
print(ar
+ 10)
print(ar
* 2)
print(1 / (ar
+1))
print(ar
** 0.5)
print(ar
.mean
())
print(ar
.max())
print(ar
.min())
print(ar
.std
())
print(ar
.var
())
print(ar
.sum(), np
.sum(ar
,axis
= 0))
print(np
.sort
(np
.array
([1,4,3,2,5,6])))
[[10 11 12]
[13 14 15]]
[[ 0 2 4]
[ 6 8 10]]
[[1. 0.5 0.33333333]
[0.25 0.2 0.16666667]]
[[0. 1. 1.41421356]
[1.73205081 2. 2.23606798]]
2.5
5
0
1.707825127659933
2.9166666666666665
15 [3 5 7]
[1 2 3 4 5 6]
索引及切片
核心:基本索引及切片 / 布尔型索引及切片
基本索引及切片
ar
= np
.arange
(20)
print(ar
)
print(ar
[4])
print(ar
[3:6])
print('-----')
ar
= np
.arange
(16).reshape
(4,4)
print(ar
, '数组轴数为%i' %ar
.ndim
)
print(ar
[2], '数组轴数为%i' %ar
[2].ndim
)
print(ar
[2][1])
print(ar
[1:3], '数组轴数为%i' %ar
[1:3].ndim
)
print(ar
[2,2])
print(ar
[:2,1:])
print('-----')
ar
= np
.arange
(8).reshape
(2,2,2)
print(ar
, '数组轴数为%i' %ar
.ndim
)
print(ar
[0], '数组轴数为%i' %ar
[0].ndim
)
print(ar
[0][0], '数组轴数为%i' %ar
[0][0].ndim
)
print(ar
[0][0][1], '数组轴数为%i' %ar
[0][0][1].ndim
)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
4
[3 4 5]
-----
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]] 数组轴数为
2
[ 8 9 10 11] 数组轴数为
1
9
[[ 4 5 6 7]
[ 8 9 10 11]] 数组轴数为
2
10
[[1 2 3]
[5 6 7]]
-----
[[[0 1]
[2 3]]
[[4 5]
[6 7]]] 数组轴数为
3
[[0 1]
[2 3]] 数组轴数为
2
[0 1] 数组轴数为
1
1 数组轴数为
0
布尔型索引及切片
ar
= np
.arange
(12).reshape
(3,4)
i
= np
.array
([True,False,True])
j
= np
.array
([True,True,False,False])
print(ar
)
print(i
)
print(j
)
print(ar
[i
,:])
print(ar
[:,j
])
m
= ar
> 5
print(m
)
print(ar
[m
])
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[ True False True]
[ True True False False]
[[ 0 1 2 3]
[ 8 9 10 11]]
[[0 1]
[4 5]
[8 9]]
[[False False False False]
[False False True True]
[ True True True True]]
[ 6 7 8 9 10 11]
数组索引及切片的值更改、复制
ar
= np
.arange
(10)
print(ar
)
ar
[5] = 100
ar
[7:9] = 200
print(ar
)
ar
= np
.arange
(10)
b
= ar
.copy
()
b
[7:9] = 200
print(ar
)
print(b
)
[0 1 2 3 4 5 6 7 8 9]
[ 0 1 2 3 4 100 6 200 200 9]
[0 1 2 3 4 5 6 7 8 9]
[ 0 1 2 3 4 5 6 200 200 9]
随机数
numpy.random包含多种概率分布的随机样本,是数据分析辅助的重点工具之一。
随机数生成
samples
= np
.random
.normal
(size
=(4,4))
print(samples
)
[[-0.61634688 0.37581849 -1.31726879 0.49950363]
[-1.57738834 2.00591884 -0.37702136 0.58266482]
[ 1.68567692 -0.2055025 1.04342144 -0.63889412]
[-1.19961568 -0.93250713 0.90465594 0.02473421]]
均匀分布
numpy.random.rand(d0, d1, …, dn):生成一个[0,1)之间的随机浮点数或N维浮点数组。
import matplotlib
.pyplot
as plt
% matplotlib inline
a
= np
.random
.rand
()
print(a
,type(a
))
b
= np
.random
.rand
(4)
print(b
,type(b
))
c
= np
.random
.rand
(2,3)
print(c
,type(c
))
samples1
= np
.random
.rand
(1000)
samples2
= np
.random
.rand
(1000)
plt
.scatter
(samples1
,samples2
)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bZShImSc-1572572806626)(/media/editor/02_20191031103812071282.png)]
正态分布
numpy.random.randn(d0, d1, …, dn):生成一个浮点数或N维浮点数组。
samples1
= np
.random
.randn
(1000)
samples2
= np
.random
.randn
(1000)
plt
.scatter
(samples1
,samples2
)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kMVLLUa4-1572572806626)(/media/editor/03_20191031103912719252.png)]
随机整数
numpy.random.randint(low, high=None, size=None, dtype=‘l’):生成一个整数或N维整数数组。
若high不为None时,取[low,high)之间随机整数,否则取值[0,low)之间随机整数,且high必须大于low 。
dtype参数:只能是int类型 。
print(np
.random
.randint
(2))
print(np
.random
.randint
(2,size
=5))
print(np
.random
.randint
(2,6,size
=5))
print(np
.random
.randint
(2,size
=(2,3)))
print(np
.random
.randint
(2,6,(2,3)))
0
[0 1 0 1 1]
[2 4 5 2 3]
[[0 1 1]
[1 0 1]]
[[4 5 3]
[4 4 2]]
运算
逻辑运算
stock_change
= np
.random
.normal
(0, 1, (8, 10))
stock_change
= stock_change
[0:5, 0:5]
stock_change
> 0.5
array
([[False, True, False, False, False],
[False, False, False, False, False],
[False, False, True, False, False],
[False, True, False, False, False],
[False, False, False, False, True]])
BOOL赋值, 将满足条件的设置为指定的值-布尔索引
stock_change
[stock_change
> 0.5] = 1
stock_change
array
([[-0.08172289, 1. , -0.50373174, -0.9375507 , 0.34120343],
[-0.49157673, -0.16351183, -0.70878476, 0.15879504, -0.69573837],
[ 0.48018901, -0.2707279 , 1. , 0.22445617, -1.42907929],
[ 0.25910682, 1. , -1.13403731, -1.07365545, 0.19876236],
[-1.03328948, 0.19431226, -0.80775053, -1.54511748, 1. ]])
通用判断函数
判断stock_change[0:2, 0:5]是否全是上涨的
np
.all(stock_change
[0:2, 0:5] > 0)
False
判断前5只股票这段期间是否有上涨的
np
.any(stock_change
[0:5, :] > 0 )
True
三元运算符np.where
判断前四个股票前四天的涨跌幅 大于0的置为1,否则为0
temp
= stock_change
[:4, :4]
np
.where
(temp
> 0, 1, 0)
array
([[0, 1, 0, 0],
[0, 0, 0, 1],
[1, 0, 1, 1],
[1, 1, 0, 0]])
判断前四个股票前四天的涨跌幅 大于0.5并且小于1的,换为1,否则为0
判断前四个股票前四天的涨跌幅 大于0.5或者小于-0.5的,换为1,否则为0
np
.where
(np
.logical_and
(temp
> 0.5, temp
< 1), 1, 0)
np
.where
(np
.logical_or
(temp
> 0.5, temp
< -0.5), 1, 0)
array
([[0, 1, 1, 1],
[0, 0, 1, 0],
[0, 0, 1, 0],
[0, 1, 1, 1]])
统计运算
接下来对于这4只股票的4天数据,进行一些统计运算。指定行 去统计。
print("前四只股票前四天的最大涨幅{}".format(np
.max(temp
, axis
=1)))
print("前四只股票前四天的最大跌幅{}".format(np
.min(temp
, axis
=1)))
print("前四只股票前四天的波动程度{}".format(np
.std
(temp
, axis
=1)))
print("前四只股票前四天的平均涨跌幅{}".format(np
.mean
(temp
, axis
=1)))
前四只股票前四天的最大涨幅
[1. 0.15879504 1. 1. ]
前四只股票前四天的最大跌幅
[-0.9375507 -0.70878476 -0.2707279 -1.13403731]
前四只股票前四天的波动程度
[0.71955576 0.32898407 0.4583192 0.90567091]
前四只股票前四天的平均涨跌幅
[-0.13075133 -0.30126957 0.35847932 -0.23714649]
获取股票指定哪一天的涨幅最大
print("前四只股票前四天内涨幅最大{}".format(np
.argmax
(temp
, axis
=1)))
print("前四天一天内涨幅最大的股票{}".format(np
.argmax
(temp
, axis
=0)))
前四只股票前四天内涨幅最大
[1 3 2 1]
前四天一天内涨幅最大的股票
[2 0 2 2]
矩阵运算
a
= np
.array
(
[[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
b
= np
.array
([[0.7], [0.3]])
np
.matmul
(a
, b
)
np
.dot
(a
,b
)
array
([[81.8],
[81.4],
[82.9],
[90. ],
[84.8],
[84.4],
[78.6],
[92.6]])
np.matmul和np.dot的区别:
二者都是矩阵乘法。 np.matmul中禁止矩阵与标量的乘法。 在矢量乘矢量的內积运算中,np.matmul与np.dot没有区别。
最后
学习笔记,温故知新。有任何问题请留言,谢谢!
下一篇:机器学习笔记(三)——Pandas(待续)