朴素贝叶斯
1. 基于贝叶斯决策理论的分类方法2. 条件概率3. 使用条件概率来分类4. 使用 Python 进行文本分类5. 练习:使用朴素贝叶斯过滤垃圾邮件
1. 基于贝叶斯决策理论的分类方法
优点:在数据较少的情况下仍然有效,可以处理多类别问题缺点:对于输入数据的准备方式较敏感适用数据类型:标称型数据核心思想:选择具有最高概率的决策。如
p
1
p_1
p1 代表点
(
x
,
y
)
(x, y)
(x,y) 属于类别 1 的概率,
p
2
p_2
p2 代表属于类别 2 的概率,若
p
1
>
p
2
p_1>p_2
p1>p2 ,那么推测该点为类别 1,反之为类别 2朴素:特征之间相互独立,或者每个特征同等重要
2. 条件概率
在 B 发生的情况下,A 发生的概率:
p
(
A
∣
B
)
=
p
(
A
B
)
p
(
B
)
p(A|B) = \frac{p(AB)}{p(B)}
p(A∣B)=p(B)p(AB)贝叶斯准则:
P
(
A
∣
B
)
=
p
(
B
∣
A
)
p
(
A
)
p
(
B
)
P(A|B) = \frac{p(B|A)p(A)}{p(B)}
P(A∣B)=p(B)p(B∣A)p(A)
3. 使用条件概率来分类
对于向量
w
\bf{w}
w,该向量属于
c
i
c_i
ci 的概率:
p
(
c
i
∣
w
)
=
p
(
w
∣
c
i
)
p
(
c
i
)
p
(
w
)
p(c_i | {\bf{w}}) = \frac{p( {\bf{w}} | c_i)p(c_i)}{p({\bf{w}})}
p(ci∣w)=p(w)p(w∣ci)p(ci)如果
p
(
c
1
∣
w
)
>
p
(
c
2
∣
w
)
p(c_1|{\bf{w}}) > p(c_2|{\bf{w}})
p(c1∣w)>p(c2∣w),那么属于类别
c
1
c_1
c1,如果
p
(
c
1
∣
w
)
<
p
(
c
2
∣
w
)
p(c_1|{\bf{w}}) < p(c_2|{\bf{w}})
p(c1∣w)<p(c2∣w),那么属于类别
c
2
c_2
c2对于朴素贝叶斯,假设各特征之间相互独立,则
p
(
w
∣
c
i
)
=
p
(
w
1
∣
c
i
)
p
(
w
2
∣
c
i
)
.
.
.
p
(
w
n
∣
c
i
)
p({\bf{w}}|c_i) = p(w_1|c_i)p(w_2|c_i)...p(w_n|c_i)
p(w∣ci)=p(w1∣ci)p(w2∣ci)...p(wn∣ci)
4. 使用 Python 进行文本分类
以在线社区的留言板为例,分为侮辱类和非侮辱类,分别使用 1 和 0 表示
准备数据:从文本中构建词向量
'''创建实验样本,返回的第一个变量是进行词条切分后的文档集合,第二个变量是类别标签'''
def loadDataSet():
postingList
=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec
= [0,1,0,1,0,1]
return postingList
,classVec
'''创建一个包含在所有文档中出现的不重复的词列表'''
def createVocabList(dataSet
):
vocabSet
= set([])
for document
in dataSet
:
vocabSet
= vocabSet
| set(document
)
return list(vocabSet
)
'''输入为词汇表和文档,输出文档向量,表示词汇表的单词在文档中是否出现'''
def setOfWords2Vec(vocabList
, inputSet
):
returnVec
= [0] * len(vocabList
)
for word
in inputSet
:
if word
in vocabList
:
returnVec
[vocabList
.index
(word
)] = 1
else:
print('the word: %s is not in my Vocabulary!' % word
)
return returnVec
[IN
]: listOPosts
, listClasses
= loadDataSet
()
[IN
]: myVocabList
= createVocabList
(listOPosts
)
[IN
]: print(myVocabList
)
[OUT
]: ['garbage', 'not', 'steak', 'is', 'dog', 'how', 'my', 'food', 'to', 'licks', 'mr',
'buying', 'so', 'problems', 'park', 'stop', 'ate', 'help', 'stupid', 'love', 'flea',
'worthless', 'take', 'posting', 'has', 'cute', 'dalmation', 'quit', 'please', 'him', 'maybe', 'I']
[IN
]: print(setOfWords2Vec
(myVocabList
, listOPosts
[0]))
[OUT
]: [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
[IN
]: print(setOfWords2Vec
(myVocabList
, listOPosts
[3]))
[OUT
]: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
训练算法:从词向量计算概率
'''输入为文档矩阵,标签向量,输出为p(w|c0),p(w|c1),p(c1)'''
import numpy
as np
def trainNB0(trainMatrix
, trainCategory
):
numTrainDocs
= len(trainMatrix
)
numWords
= len(trainMatrix
[0])
pAbusive
= sum(trainCategory
) / float(numTrainDocs
)
p0Num
= np
.ones
(numWords
)
p1Num
= np
.ones
(numWords
)
p0Denom
= 2
p1Denom
= 2
for i
in range(numTrainDocs
):
if trainCategory
[i
] == 1:
p1Num
+= trainMatrix
[i
]
p1Denom
+= sum(trainMatrix
[i
])
else:
p0Num
+= trainMatrix
[i
]
p0Denom
+= sum(trainMatrix
[i
])
p1Vect
= np
.log
(p1Num
/ p1Denom
)
p0Vect
= np
.log
(p0Num
/ p0Denom
)
return p0Vect
, p1Vect
, pAbusive
[IN
]: trainMat
= []
[IN
]: for postinDoc
in listOPosts
:
trainMat
.append
(setOfWords2Vec
(myVocabList
, postinDoc
))
[IN
]: p0V
, p1V
, pAb
= trainNB0
(trainMat
, listClasses
)
[IN
]: pAb
[OUT
]: 0.5
[IN
]: p0V
[OUT
]: array
([-3.25809654, -3.25809654, -2.56494936, -2.56494936, -2.56494936,
-2.56494936, -1.87180218, -3.25809654, -2.56494936, -2.56494936,
-2.56494936, -3.25809654, -2.56494936, -2.56494936, -3.25809654,
-2.56494936, -2.56494936, -2.56494936, -3.25809654, -2.56494936,
-2.56494936, -3.25809654, -3.25809654, -3.25809654, -2.56494936,
-2.56494936, -2.56494936, -3.25809654, -2.56494936, -2.15948425,
-3.25809654, -2.56494936])
[IN
]: p1V
[OUT
]: array
([-2.35137526, -2.35137526, -3.04452244, -3.04452244, -1.94591015,
-3.04452244, -3.04452244, -2.35137526, -2.35137526, -3.04452244,
-3.04452244, -2.35137526, -3.04452244, -3.04452244, -2.35137526,
-2.35137526, -3.04452244, -3.04452244, -1.65822808, -3.04452244,
-3.04452244, -1.94591015, -2.35137526, -2.35137526, -3.04452244,
-3.04452244, -3.04452244, -2.35137526, -3.04452244, -2.35137526,
-2.35137526, -3.04452244])
构建完整的分类器
import numpy
as np
'''分类器'''
def classifyNB(vec2Classify
, p0Vec
, p1Vec
, pClass1
):
p1
= sum(vec2Classify
* p1Vec
) + np
.log
(pClass1
)
p0
= sum(vec2Classify
* p0Vec
) + np
.log
(1.0 - pClass1
)
if p1
> p0
:
return 1
else:
return 0
'''测试'''
def testingNB():
listOPosts
, listClasses
= loadDataSet
()
myVocabList
= createVocabList
(listOPosts
)
trainMat
= []
for postinDoc
in listOPosts
:
trainMat
.append
(setOfWords2Vec
(myVocabList
, postinDoc
))
p0V
, p1V
, pAb
= trainNB0
(np
.array
(trainMat
), np
.array
(listClasses
))
testEntry
= ['love', 'my', 'dog']
thisDoc
= np
.array
(setOfWords2Vec
(myVocabList
, testEntry
))
print(testEntry
, 'classified as: ', classifyNB
(thisDoc
, p0V
, p1V
, pAb
))
testEntry
= ['stupid', 'garbage']
thisDoc
= np
.array
(setOfWords2Vec
(myVocabList
, testEntry
))
print(testEntry
, 'classified as: ', classifyNB
(thisDoc
, p0V
, p1V
, pAb
))
[IN
]: testingNB
()
[OUT
]: ['love', 'my', 'dog'] classified
as: 0
[OUT
]: ['stupid', 'garbage'] classified
as: 1
文档词袋模型:之前只是词集,判断单词出现与否,词袋模型可以计算单词出现了多少次
def bagOfWords2VecMN(vocabList
, inputSet
):
returnVec
= [0] * len(vocabList
)
for word
in inputSet
:
if word
in vocabList
:
returnVec
[vocabList
.index
(word
)] += 1
return returnVec
5. 练习:使用朴素贝叶斯过滤垃圾邮件
解析文本,提取单词:
def textParse(bigString
):
import re
listOfTokens
= re
.split
(r
'\W*', bigString
)
return [tok
.lower
() for tok
in listOfTokens
if len(tok
) > 2]
使用朴素贝叶斯进行交叉验证:
def spamTest():
import numpy
as np
import random
docList
= []
classList
= []
fullText
= []
for i
in range(1, 26):
wordList
= textParse
(open('Ch04/email/spam/%d.txt' % i
, encoding
='ISO-8859-1').read
())
docList
.append
(wordList
)
fullText
.extend
(wordList
)
classList
.append
(1)
wordList
= textParse
(open('Ch04/email/ham/%d.txt' % i
, encoding
='ISO-8859-1').read
())
docList
.append
(wordList
)
fullText
.extend
(wordList
)
classList
.append
(0)
vocabList
= createVocabList
(docList
)
trainingSet
= list(range(50))
testSet
= []
for i
in range(10):
randIndex
= int(random
.uniform
(0, len(trainingSet
)))
testSet
.append
(trainingSet
[randIndex
])
del (trainingSet
[randIndex
])
trainMat
= []
trainClasses
= []
for docIndex
in trainingSet
:
trainMat
.append
(setOfWords2Vec
(vocabList
, docList
[docIndex
]))
trainClasses
.append
(classList
[docIndex
])
p0V
, p1V
, pSpam
= trainNB0
(np
.array
(trainMat
), np
.array
(trainClasses
))
errorCount
= 0
for docIndex
in testSet
:
wordVector
= setOfWords2Vec
(vocabList
, docList
[docIndex
])
if classifyNB
(np
.array
(wordVector
), p0V
, p1V
, pSpam
) != classList
[docIndex
]:
errorCount
+= 1
print('the error rate is: ', float(errorCount
) / len(testSet
))
[IN
]: for i
in range(10):
spamTest
()
[OUT
]: the error rate
is: 0.1
the error rate
is: 0.0
the error rate
is: 0.1
the error rate
is: 0.0
the error rate
is: 0.0
the error rate
is: 0.1
the error rate
is: 0.1
the error rate
is: 0.1
the error rate
is: 0.1
the error rate
is: 0.2