最近在看计算机视觉:模型学习与推理。本篇博客介绍:使用狄利克雷分布作为先验概率计算分类分布的最大后验概率。
基本原理如下:
λ ^ 1 … k = argmax λ 1 … k [ ∏ i = 1 I Pr ( x i ∣ λ 1 … k ) Pr ( λ 1 … k ) ] = argmax λ 1 … k [ ∏ i = 1 I Cat x i [ λ 1 … k ] Dir λ 1 … k [ α 1 … k ] ] = argmax λ 1 … k [ ∏ i = 1 k λ k N k ∏ k = 1 k λ α k − 1 ] = argmax λ 1 … k [ ∏ k = 1 k λ k N k + α k − 1 ] \begin{aligned} \hat{\lambda}_{1 \ldots k} &=\underset{\lambda_{1} \ldots k}{\operatorname{argmax}}\left[\prod_{i=1}^{I} \operatorname{Pr}\left(x_{i} | \lambda_{1 \ldots k}\right) \operatorname{Pr}\left(\lambda_{1 \ldots k}\right)\right] \\ &=\underset{\lambda_{1} \ldots k}{\operatorname{argmax}}\left[\prod_{i=1}^{I} \operatorname{Cat}_{x_{i}}\left[\lambda_{1 \ldots k}\right] \operatorname{Dir}_{\lambda_{1} \ldots k}\left[\alpha_{1 \ldots k}\right]\right] \\ &=\underset{\lambda_{1 \ldots k}}{\operatorname{argmax}}\left[\prod_{i=1}^{k} \lambda_{k}^{N_{k}} \prod_{k=1}^{k} \lambda^{\alpha_{k}-1}\right] \\ &=\underset{\lambda_{1 \ldots k}}{\operatorname{argmax}}\left[\prod_{k=1}^{k} \lambda_{k}^{N_{k}+\alpha_{k}-1}\right] \end{aligned} λ^1…k=λ1…kargmax[i=1∏IPr(xi∣λ1…k)Pr(λ1…k)]=λ1…kargmax[i=1∏ICatxi[λ1…k]Dirλ1…k[α1…k]]=λ1…kargmax[i=1∏kλkNkk=1∏kλαk−1]=λ1…kargmax[k=1∏kλkNk+αk−1]
最终结果如下:
λ ^ k = N k + α k − 1 ∑ m = 1 k ( N m + α m − 1 ) \hat{\lambda}_{k}=\frac{N_{k}+\alpha_{k}-1}{\sum_{m=1}^{k}\left(N_{m}+\alpha_{m}-1\right)} λ^k=∑m=1k(Nm+αm−1)Nk+αk−1
数据生成使用的是上一篇文章的方法,这里不再介绍。
算法流程如下: Input : Binary training data { x i } i = 1 I , Hyperparameters { α k } k = 1 K Output: MAP estimates of parameters θ = { λ k } k = 1 K begin for k = 1 to K d o λ k = ( N k − 1 + α k ) / ( I − K + ∑ k = 1 K α k ) end end \begin{array}{l}{\text { Input : Binary training data }\left\{x_{i}\right\}_{i=1}^{I}, \text { Hyperparameters }\left\{\alpha_{k}\right\}_{k=1}^{K}} \\ {\text { Output: MAP estimates of parameters } \theta=\left\{\lambda_{k}\right\}_{k=1}^{K}} \\ {\text { begin }} \\ {\text { for } k=1 \text { to } K \mathrm{do}} \\ {\qquad \lambda_{k}=\left(N_{k}-1+\alpha_{k}\right) /\left(I-K+\sum_{k=1}^{K} \alpha_{k}\right)} \\ {\text { end }} \\ {\text { end }}\end{array} Input : Binary training data {xi}i=1I, Hyperparameters {αk}k=1K Output: MAP estimates of parameters θ={λk}k=1K begin for k=1 to Kdoλk=(Nk−1+αk)/(I−K+∑k=1Kαk) end end
学习的代码如下:
void MAP_categorical_distribution_parameters() { vector<int> data; data = generate_categorical_distribution_data(100000); std::map<int, double> hist{}; for (int i = 0; i < data.size(); i++) { ++hist[data[i]]; } vector<double> alpha_v; //set Drichilet distribution superparameters for (int i = 0; i < hist.size(); i++) { alpha_v.push_back(1.0); } double total_p = 0; double down=0; for (int i = 0; i < hist.size(); i++) { down += hist.at(i) + alpha_v[i] - 1; } for (int i = 0; i < hist.size(); i++) { hist.at(i) = (hist.at(i) + alpha_v[i] - 1) / down; total_p += hist.at(i); std::cout << hist.at(i) << std::endl; } cout << "total_p: " << total_p << endl; }这里Dirichlet 分布的超参数都是设置成了1;
下面两个图片是书中作者的实验对比