Logisric回归不仅可以解决二分类问题,也可以解决多分类问题。其目的是找到一个区分度足够好的决策边界,将每种类别进行划分。 假设输入的数据特征向量 x ∈ R n \displaystyle x\in R^{n} x∈Rn,那么希望找到一条决策边界 ∑ i = 1 n w i x i + b = 0 \displaystyle \sum ^{n}_{i=1} w_{i} x_{i} +b=0 i=1∑nwixi+b=0,使得当 ∑ i = 1 n w i x i + b ∈ ( 0 , + ∞ ) \displaystyle \sum ^{n}_{i=1} w_{i} x_{i} +b\in ( 0,+\infty ) i=1∑nwixi+b∈(0,+∞)时,为正样本; ∑ i = 1 n w i x i + b ∈ ( − ∞ , 0 ) \displaystyle \sum ^{n}_{i=1} w_{i} x_{i} +b\in ( -\infty,0) i=1∑nwixi+b∈(−∞,0)时为负样本。
h ( x ) = w x + b = 0 \displaystyle h(x) = wx+b=0 h(x)=wx+b=0 ( w , x w,x w,x是向量)
sigmoid函数将正、负样本判定范围从 ( − ∞ , + ∞ ) \displaystyle ( -\infty ,+\infty ) (−∞,+∞)缩放到(0,1)。 h ( z ) = 1 1 + e − z \displaystyle h( z) \ =\ \frac{1}{1+e^{-z}} h(z) = 1+e−z1
h ( x ) = 1 1 + e − w x − b \displaystyle h( x) \ =\ \frac{1}{1+e^{-wx-b}} h(x) = 1+e−wx−b1
对于正样本, g ( x ) g(x) g(x)越大, h ( x ) h( x) h(x)越趋向于1,正样本的概率越高;即 l o g ( h ( x ) ) log(h( x)) log(h(x))越大。对于负样本, g ( x ) g(x) g(x)越小, h ( x ) h( x) h(x)越趋向于0, 1 − h ( x ) 1-h( x) 1−h(x)越大, l o g ( 1 − h ( x ) ) log(1-h( x)) log(1−h(x))越大。又因为 y y y只有0和1,两个取值。所以优化的损失函数可以整理为如下式子: L ( w ) = 1 m ∑ i = 1 m [ − y ( i ) l o g ( h w ( x ( i ) ) ) − ( 1 − y ( i ) ) l o g ( 1 − h w ( x ( i ) ) ) ] L(w) = \frac{1}{m}\sum_{i=1}^{m}\big[-y^{(i)}\, log\,( h_w\,(x^{(i)}))-(1-y^{(i)})\,log\,(1-h_w(x^{(i)}))\big] L(w)=m1i=1∑m[−y(i)log(hw(x(i)))−(1−y(i))log(1−hw(x(i)))]
(1)通过找到分类概率P(Y = 1) 与输入变量Z 的直接关系,然后通过比较概率值来判断类别。简单来说就是通过计算下面两个概率分布: P ( Y = 0 ∣ x ) = 1 1 + e w x + b \displaystyle P( Y=0\mid x) \ =\ \frac{1}{1+e^{wx+b}} P(Y=0∣x) = 1+ewx+b1
P ( Y = 1 ∣ x ) = e w x + b 1 + e w x + b \displaystyle P( Y=1\mid x) \ =\ \frac{e^{wx+b}}{1+e^{wx+b}} P(Y=1∣x) = 1+ewx+bewx+b (2)一个事件发生的几率(odds)是指该事件发生的概率与不发生的概率的比值,比如一个事件发生的概率是 p p p。那么该事件发生的几率是 p 1 − p \displaystyle \frac{p}{1-p} 1−pp,该事件的对数几率或logit函数是: l o g i t ( p ) = l o g p 1 − p \displaystyle logit( p) =log\frac{p}{1-p} logit(p)=log1−pp (3)样本为正样本,即Y=1的对数几率为: l o g P ( Y = 1 ∣ x ) 1 − P ( Y = 1 ∣ x ) = w x + b \displaystyle log\frac{P( Y=1|x)}{1-P( Y=1|x)} =wx+b log1−P(Y=1∣x)P(Y=1∣x)=wx+b (4)计算最大似然估计。 对于给定的训练集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . ( x n , y n ) } \displaystyle T\ =\ \{( x_{1} ,y_{1}) ,( x_{2} ,y_{2}) ,...( x_{n} ,y_{n}) \} T = {(x1,y1),(x2,y2),...(xn,yn)},其中 x i ∈ R n \displaystyle x_{i}\in R^{n} xi∈Rn, y i ∈ 0 , 1 \displaystyle y_{i}\in {0,1} yi∈0,1,假设 P ( Y = 0 ∣ x ) = g ( x ) \displaystyle P( Y=0\mid x)=g(x) P(Y=0∣x)=g(x),那么 P ( Y = 0 ∣ x ) = 1 − h ( x ) \displaystyle P( Y=0\mid x)=1-h(x) P(Y=0∣x)=1−h(x),所以似然函数如下式: ∏ i = 1 n [ h ( x i ) ] y i [ 1 − h ( x i ) ] 1 − y i \displaystyle \prod ^{n}_{i=1}[ h( x_{i})]^{y_{i}}[ 1-h( x_{i})]^{1-y_{i}} i=1∏n[h(xi)]yi[1−h(xi)]1−yi
(5)对似然函数取对数:
L ( w ) = 1 m ∑ i = 1 m [ − y ( i ) l o g ( h w ( x ( i ) ) ) − ( 1 − y ( i ) ) l o g ( 1 − h w ( x ( i ) ) ) ] L(w) = \frac{1}{m}\sum_{i=1}^{m}\big[-y^{(i)}\, log\,( h_w\,(x^{(i)}))-(1-y^{(i)})\,log\,(1-h_w(x^{(i)}))\big] L(w)=m1∑i=1m[−y(i)log(hw(x(i)))−(1−y(i))log(1−hw(x(i)))] = ∑ i = 1 m [ y ( i ) l o g h w ( x ( i ) ) 1 − h w ( x ( i ) ) + l o g ( 1 − h w ( x ( i ) ) ) \displaystyle \sum ^{m}_{i=1}[ y^{( i)} \ log\frac{h_{w }\left( x^{( i)}\right)}{1-h_{w }\left( x^{( i)}\right)} \ +\ log\left( 1-h_{w }\left( x^{( i)}\right)\right) i=1∑m[y(i) log1−hw(x(i))hw(x(i)) + log(1−hw(x(i)))= ∑ i = 1 m [ y ( i ) ( w ⋅ x i + b ) − l o g ( 1 + e w ⋅ x i + b ) ] \displaystyle \sum ^{m}_{i=1}\left[ y^{( i)}( w\cdot x_{i} +b) \ -\ log\left( 1+e^{w\cdot x_{i} +b}\right)\right] i=1∑m[y(i)(w⋅xi+b) − log(1+ew⋅xi+b)]
(6) L ( w ) L(w) L(w)对W求导:
∂ L ( w ) ∂ w = ∑ i = 1 n y i x i − ∑ i = 1 n e w x i + b 1 + e w x i + b x i = ∑ i = 1 n ( y i − l o g i t ( w ⋅ x i ) ) x i \displaystyle \frac{\partial L( w)}{\partial w} =\sum ^{n}_{i=1} y_{i} x_{i} -\sum ^{n}_{i=1}\frac{e^{wx_{i} +b}}{1+e^{wx_{i} +b}} x_{i} =\sum ^{n}_{i=1}( y_{i} -logit( w\cdot x_{i})) x_{i} ∂w∂L(w)=i=1∑nyixi−i=1∑n1+ewxi+bewxi+bxi=i=1∑n(yi−logit(w⋅xi))xi
∂ L ( w ) ∂ b = ∑ i = 1 n y i − ∑ i = 1 n e w x i + b 1 + e w x i + b = ∑ i = 1 n ( y i − l o g i t ( w ⋅ x i ) ) \displaystyle \frac{\partial L( w)}{\partial b} =\sum ^{n}_{i=1} y_{i} -\sum ^{n}_{i=1}\frac{e^{wx_{i} +b}}{1+e^{wx_{i} +b}} =\sum ^{n}_{i=1}( y_{i} -logit( w\cdot x_{i})) ∂b∂L(w)=i=1∑nyi−i=1∑n1+ewxi+bewxi+b=i=1∑n(yi−logit(w⋅xi)) (7)设定初始参数和学习率,通过梯度下降法不断对参数进行优化: w = w − α ∂ L ( w ) ∂ w \displaystyle w\ =\ w\ -\ \alpha \frac{\partial L( w)}{\partial w} w = w − α∂w∂L(w)
b = b − α ∂ L ( b ) ∂ b \displaystyle b\ =\ b\ -\ \alpha \frac{\partial L( b)}{\partial b} b = b − α∂b∂L(b)
决策边界如下图: