吴恩达机器学习笔记（8）异常检测与推荐系统

mac2025-11-08 9

这一部分是对吴恩达机器学习异常检测和推荐系统部分内容的总结，主要分为以下几个部分

1.异常检测

2.推荐系统

1.异常检测

异常检测实际上就是一种运用高斯分布来预测每一个样本如果正常它的出现概率是多少，如果这个样本概率很低，我们就可以把它视为异常点。异常检测里面有很多值得说道的东西，我把它写在实体笔记里面。这里仅讨论其代码实现

首先用样本均值代替高斯分布的mu，并且用样本标准差代替高斯分布的sigma，定义函数如下：

function [mu sigma2] = estimateGaussian(X) %ESTIMATEGAUSSIAN This function estimates the parameters of a %Gaussian distribution using the data in X % [mu sigma2] = estimateGaussian(X), % The input X is the dataset with each n-dimensional data point in one row % The output is an n-dimensional vector mu, the mean of the data set % and the variances sigma^2, an n x 1 vector % % Useful variables [m, n] = size(X); % You should return these values correctly mu = zeros(n, 1); sigma2 = zeros(n, 1); mu=(sum(X)./m)'; sigma2=sum((X-repmat(mu',m,1)).^2)/m; end

并且按照公式计算出每一个样本点出现的概率：

function p = multivariateGaussian(X, mu, Sigma2) %MULTIVARIATEGAUSSIAN Computes the probability density function of the %multivariate gaussian distribution. % p = MULTIVARIATEGAUSSIAN(X, mu, Sigma2) Computes the probability % density function of the examples X under the multivariate gaussian % distribution with parameters mu and Sigma2. If Sigma2 is a matrix, it is % treated as the covariance matrix. If Sigma2 is a vector, it is treated % as the \sigma^2 values of the variances in each dimension (a diagonal % covariance matrix) % k = length(mu); if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1) Sigma2 = diag(Sigma2); end X = bsxfun(@minus, X, mu(:)'); p = (2 * pi) ^ (- k / 2) * det(Sigma2) ^ (-0.5) * ... exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2)); end

然后寻找一阈值epsilon，使得出现概率低于该值时，标记为异常，定义寻找最优阈值函数如下：

function [bestEpsilon bestF1] = selectThreshold(yval, pval) %SELECTTHRESHOLD Find the best threshold (epsilon) to use for selecting %outliers % [bestEpsilon bestF1] = SELECTTHRESHOLD(yval, pval) finds the best % threshold to use for selecting outliers based on the results from a % validation set (pval) and the ground truth (yval). % bestEpsilon = 0; bestF1 = 0; F1 = 0; stepsize = (max(pval) - min(pval)) / 1000; for epsilon = min(pval):stepsize:max(pval) % predictions = pval < epsilon; % tp = sum((predictions == 1) & (yval == 1)); %true positive % fp = sum((predictions == 1) & (yval == 0)); %false positive % fn = sum((predictions == 0) & (yval == 1)); %false negative % prec = tp/(tp+fp); %precision % rec = tp/(tp+fn); %recall % F1 = 2*prec*rec/(prec+rec); pred = pval < epsilon; % pred=epsilon; TP= sum((pred==1)&(yval==1)); FP = sum((pred==1)&(yval == 0)); FN=sum((pred==0)&(yval==1)); prec=TP/(TP+FP); rec=TP/(TP+FN); F1=(2*prec*rec)/(prec+rec); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the F1 score of choosing epsilon as the % threshold and place the value in F1. The code at the % end of the loop will compare the F1 score for this % choice of epsilon and set it to be the best epsilon if % it is better than the current choice of epsilon. % % Note: You can use predictions = (pval < epsilon) to get a binary vector % of 0's and 1's of the outlier prediction if F1 > bestF1 bestF1 = F1; bestEpsilon = epsilon; end end end

这样就完成了异常检测的算法，在主程序中演示如下：

Part 1: Load Example Dataset =================== % We start this exercise by using a small dataset that is easy to % visualize. % % Our example case consists of 2 network server statistics across % several machines: the latency and throughput of each machine. % This exercise will help us find possibly faulty (or very fast) machines. % fprintf('Visualizing example dataset for outlier detection.\n\n'); % The following command loads the dataset. You should now have the % variables X, Xval, yval in your environment load('ex8data1.mat'); % Visualize the example dataset plot(X(:, 1), X(:, 2), 'bx'); axis([0 30 0 30]); xlabel('Latency (ms)'); ylabel('Throughput (mb/s)'); fprintf('Program paused. Press enter to continue.\n'); pause %% ================== Part 2: Estimate the dataset statistics =================== % For this exercise, we assume a Gaussian distribution for the dataset. % % We first estimate the parameters of our assumed Gaussian distribution, % then compute the probabilities for each of the points and then visualize % both the overall distribution and where each of the points falls in % terms of that distribution. % fprintf('Visualizing Gaussian fit.\n\n'); % Estimate my and sigma2 [mu sigma2] = estimateGaussian(X); % Returns the density of the multivariate normal at each data point (row) % of X p = multivariateGaussian(X, mu, sigma2); % Visualize the fit visualizeFit(X, mu, sigma2); xlabel('Latency (ms)'); ylabel('Throughput (mb/s)'); fprintf('Program paused. Press enter to continue.\n'); pause; %% ================== Part 3: Find Outliers =================== % Now you will find a good epsilon threshold using a cross-validation set % probabilities given the estimated Gaussian distribution % pval = multivariateGaussian(Xval, mu, sigma2); [epsilon F1] = selectThreshold(yval, pval); fprintf('Best epsilon found using cross-validation: %e\n', epsilon); fprintf('Best F1 on Cross Validation Set: %f\n', F1); fprintf(' (you should see a value epsilon of about 8.99e-05)\n'); fprintf(' (you should see a Best F1 value of 0.875000)\n\n'); % Find the outliers in the training set and plot the outliers = find(p < epsilon); % Draw a red circle around those outliers hold on plot(X(outliers, 1), X(outliers, 2), 'ro', 'LineWidth', 2, 'MarkerSize', 10); hold off fprintf('Program paused. Press enter to continue.\n'); pause; %% ================== Part 4: Multidimensional Outliers =================== % We will now use the code from the previous part and apply it to a % harder problem in which more features describe each datapoint and only % some features indicate whether a point is an outlier. % % Loads the second dataset. You should now have the % variables X, Xval, yval in your environment load('ex8data2.mat'); % Apply the same steps to the larger dataset [mu sigma2] = estimateGaussian(X); % Training set p = multivariateGaussian(X, mu, sigma2); % Cross-validation set pval = multivariateGaussian(Xval, mu, sigma2); % Find the best threshold [epsilon F1] = selectThreshold(yval, pval); fprintf('Best epsilon found using cross-validation: %e\n', epsilon); fprintf('Best F1 on Cross Validation Set: %f\n', F1); fprintf(' (you should see a value epsilon of about 1.38e-18)\n'); fprintf(' (you should see a Best F1 value of 0.615385)\n'); fprintf('# Outliers found: %d\n\n', sum(p < epsilon));

2.推荐系统

推荐系统就是一种双向拟合的算法，具体的理论见笔记，由于这个算法与图像处理关系不大，这里仅做简要代码说明，下面是计算这个算法的代价函数和梯度的函数代码：

unction [J, grad] = cofiCostFunc(params, Y, R, num_users, num_movies, ... num_features, lambda) %COFICOSTFUNC Collaborative filtering cost function % [J, grad] = COFICOSTFUNC(params, Y, R, num_users, num_movies, ... % num_features, lambda) returns the cost and gradient for the % collaborative filtering problem. % % Unfold the U and W matrices from params X = reshape(params(1:num_movies*num_features), num_movies, num_features); Theta = reshape(params(num_movies*num_features+1:end), ... num_users, num_features); % You need to return the following values correctly J = 0; X_grad = zeros(size(X)); Theta_grad = zeros(size(Theta)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost function and gradient for collaborative % filtering. Concretely, you should first implement the cost % function (without regularization) and make sure it is % matches our costs. After that, you should implement the % gradient and use the checkCostFunction routine to check % that the gradient is correct. Finally, you should implement % regularization. % % Notes: X - num_movies x num_features matrix of movie features % Theta - num_users x num_features matrix of user features % Y - num_movies x num_users matrix of user ratings of movies % R - num_movies x num_users matrix, where R(i, j) = 1 if the % i-th movie was rated by the j-th user % % You should set the following variables correctly: % % X_grad - num_movies x num_features matrix, containing the % partial derivatives w.r.t. to each element of X % Theta_grad - num_users x num_features matrix, containing the % partial derivatives w.r.t. to each element of Theta % J = sum(sum((R.*(X*Theta')-Y).^2))/2.0 + ... lambda/2.0*sum(sum(Theta.^2)) + lambda/2.0*sum(sum(X.^2)); X_grad = (R.*(X*Theta')-Y)*Theta + lambda.*X; Theta_grad = (R.*(X*Theta')-Y)'*X + lambda.*Theta; % ============================================================= grad = [X_grad(:); Theta_grad(:)]; end

这个算法和专业无关，我只是mark一下

最新回复(0)