CS229 6.5 Neurons Networks Implements of Sparse Autoencoder

sparse autoencoder的一个实例练习，这个例子所要实现的内容大概如下：从给定的很多张自然图片中截取出大小为8*8的小patches图片共10000张，现在需要用sparse autoencoder的方法训练出一个隐含层网络所学习到的特征。该网络共有3层，输入层是64个节点，隐含层是25个节点，输出层当然也是64个节点了。
main函数, 分五步走，每个函数的实现细节在下边都列出了。
 %%======================================================================

 %% STEP : Here we provide the relevant parameters values that will

 %  allow your sparse autoencoder to get good filters; you do not need to

 %  change the parameters below.

 visibleSize = *;   % number of input units

 hiddenSize = ;     % number of hidden units

 sparsityParam = 0.01;   % desired average activation of the hidden units.

                      % (This was denoted by the Greek alphabet rho,

                      % which looks like a lower-case "p",

              %  in the lecture notes).

 lambda = 0.0001;     % weight decay parameter

 beta = ;            % weight of sparsity penalty term      

 %%======================================================================

 %% STEP : Implement sampleIMAGES

 %

 %  After implementing sampleIMAGES, the display_network command should

 %  display a random sample of  patches from the dataset

 patches = sampleIMAGES;

 display_network(patches(:,randi(size(patches,),,)),);

 %  Obtain random parameters theta

 theta = initializeParameters(hiddenSize, visibleSize);

 %%======================================================================

 %% STEP : Implement sparseAutoencoderCost

 %

 %  You can implement all of the components (squared error cost, weight decay term,

 %  sparsity penalty) in the cost function at once, but it may be easier to do

 %  it step-by-step and run gradient checking (see STEP ) after each step.  We

 %  suggest implementing the sparseAutoencoderCost function using the following steps:

 %

 %  (a) Implement forward propagation in your neural network, and implement the

 %      squared error term of the cost function.  Implement backpropagation to

 %      compute the derivatives.   Then (using lambda=beta=), run Gradient Checking

 %      to verify that the calculations corresponding to the squared error cost

 %      term are correct.

 %

 %  (b) Add in the weight decay term (in both the cost function and the derivative

 %      calculations), then re-run Gradient Checking to verify correctness.

 %

 %  (c) Add in the sparsity penalty term, then re-run Gradient Checking to

 %      verify correctness.

 %

 %  Feel free to change the training settings when debugging your

 %  code.  (For example, reducing the training set size or

 %  number of hidden units may make your code run faster; and setting beta

 %  and/or lambda to zero may be helpful for debugging.)  However, in your

 %  final submission of the visualized weights, please use parameters we

 %  gave in Step  above.

 [cost, grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...

                                     lambda,sparsityParam, beta, patches);

 %%======================================================================

 %% STEP : Gradient Checking

 %

 % Hint: If you are debugging your code, performing gradient checking on smaller models

 % and smaller training sets (e.g., using only  training examples and - hidden

 % units) may speed things up.

 % First, lets make sure your numerical gradient computation is correct for a

 % simple function.  After you have implemented computeNumericalGradient.m,

 % run the following:

 checkNumericalGradient();

 % Now we can use it to check your cost function and derivative calculations

 % for the sparse autoencoder.

 numgrad = computeNumericalGradient( @(x) sparseAutoencoderCost(x, visibleSize, ...

                         hiddenSize, lambda,sparsityParam, beta, patches), theta);

 % Use this to visually compare the gradients side by side

 disp([numgrad grad]);

 % Compare numerically computed gradients with the ones obtained from backpropagation

 diff = norm(numgrad-grad)/norm(numgrad+grad);

 disp(diff); % Should be small. In our implementation, these values are

             % usually less than 1e-.

             % When you got this working, Congratulations!!!

 %%======================================================================

 %% STEP : After verifying that your implementation of

 %  sparseAutoencoderCost is correct, You can start training your sparse

 %  autoencoder with minFunc (L-BFGS).

 %  Randomly initialize the parameters

 theta = initializeParameters(hiddenSize, visibleSize);

 %  Use minFunc to minimize the function

 addpath minFunc/

 options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost

                           % function. Generally, for minFunc to work, you

                           % need a function pointer with two outputs: the

                           % function value and the gradient. In our problem,

                           % sparseAutoencoderCost.m satisfies this.

 options.maxIter = ;    % Maximum number of iterations of L-BFGS to run

 options.display = 'on';

 [opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p,visibleSize, hiddenSize, ...

                             lambda, sparsityParam, beta, patches),theta, options);

 %%======================================================================

 %% STEP : Visualization

 W1 = reshape(opttheta(:hiddenSize*visibleSize), hiddenSize, visibleSize);

 display_network(W1', 12);

 print -djpeg weights.jpg   % save the visualization to a file

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 对应step1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 %三个函数（sampleIMAGES）（normalizeData）（initializeParameters）%%%%

 function patches = sampleIMAGES()

 load IMAGES;    % 加载初始的10张512*512大图片

 patchsize = ;  % 采样大小

 numpatches = ;

 %  初始化该矩阵为0，该矩阵为 *10000维每一列为一张图片.

 patches = zeros(patchsize*patchsize, numpatches);

 %  IMAGES 为一个包含10 张images的三维数组，IMAGES(:,:,) 是一个第六张图片的 512x512 的二维数组,

 %  命令 "imagesc(IMAGES(:,:,6)), colormap gray;" 可以把第六张图可视化.

 % 这几张图是经过whiteing预处理的？

 %  IMAGES(:,:,) 就是从第一张图采样得到的(,) to (,) 的小patchs

 %在每张图片中随机选取1000个patch，共10000个patch

 for imageNum = :

     [rowNum colNum] = size(IMAGES(:,:,imageNum));

     %实现每张图片选取1000个patch

     for patchNum = :

         %得到左上角的两个点

         xPos = randi([,rowNum-patchsize+]);

         yPos = randi([, colNum-patchsize+]);

         %填充到矩阵里

         patches(:,(imageNum-)*+patchNum) = ...

             reshape(IMAGES(xPos:xPos+,yPos:yPos+,imageNum),,);

     end

 end

 %由于autoencoder的激励函数是sigmod函数，输出值限定在[,],故为了达到H W,b（x）= x，x作为输入，

 %也要限定在0-1之间，故需要进行正则化

 patches = normalizeData(patches);

 end

 % 正则化的函数，不太明白s-sigma法则？

 function patches = normalizeData(patches)

 % 减去均值

 patches = bsxfun(@minus, patches, mean(patches));

 % s = std(X)，此处X是一个矢量，该函数返回标准偏差（注意其分母为n-，而不是n） 。

 % 结果s是一个X各样本偏差无偏估计的平方根(X包含独立的、同分布样本)。

 % 如果X是一个矩阵，该函数返回一个行矢量，它包含了X每列元素的标准偏差。

 pstd =  * std(patches(:));

 patches = max(min(patches, pstd), -pstd) / pstd;

 % 重新压缩 从[-,] 到 [0.1,0.9]

 patches = (patches + ) * 0.4 + 0.1;

 end

 %首先初始化参数

 function theta = initializeParameters(hiddenSize, visibleSize)

 % Initialize parameters randomly based on layer sizes.

  % we'll choose weights uniformly from the interval [-r, r]

 r  = sqrt() / sqrt(hiddenSize+visibleSize+);

 %rand(a,b)产生均匀分布的随机矩阵维度为a*b，元素取值范围0. ～1.0。

 W1 = rand(hiddenSize, visibleSize) *  * r - r;

 %rand(a,b)**r即取值范围为（-2r）， rand(a,b)**r -r即取值范围为（-r - r）

 W2 = rand(visibleSize, hiddenSize) *  * r - r;

 b1 = zeros(hiddenSize, ); %连接到hidden unit的偏置单元

 b2 = zeros(visibleSize, ); %链接到output layer的偏置单元

 %  将矩阵合并为一个向量

 theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];

 %初始化参数结束

 end

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 对应step  %%%%%%%%%%%%%%%%%%%%%%%%%%%%

 %%%%%返回稀疏损失函数的值与梯度值%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...

                                         lambda, sparsityParam, beta, data)

 % visibleSize: 输入层单元数

 % hiddenSize: 隐藏单元数

 % lambda: 正则项

 % sparsityParam: （p）指定的平均激活度p

 % beta: 稀疏权重项B

 % data: 64x10000 的矩阵为training data,data(:,i)  是第i个训练样例.

 % 把参数拼接为一个向量，因为采用L-BFGS优化，L-BFGS要求的就是向量.

 % 将长向量转换成每一层的权值矩阵和偏置向量值

 % theta向量的的 ->hiddenSize*visibleSize，W1共hiddenSize*visibleSize 个元素，重新作为矩阵

 W1 = reshape(theta(:hiddenSize*visibleSize), hiddenSize, visibleSize);

 %类似以上一直往后放

 W2 = reshape(theta(hiddenSize*visibleSize+:*hiddenSize*visibleSize), visibleSize, hiddenSize);

 b1 = theta(*hiddenSize*visibleSize+:*hiddenSize*visibleSize+hiddenSize);

 b2 = theta(*hiddenSize*visibleSize+hiddenSize+:end);

 % 参数对应的梯度矩阵 ;

 cost = ;

 W1grad = zeros(size(W1));

 W2grad = zeros(size(W2));

 b1grad = zeros(size(b1));

 b2grad = zeros(size(b2));

 Jcost = ;  %直接误差

 Jweight = ;%权值惩罚

 Jsparse = ;%稀疏性惩罚

 [n m] = size(data); %m为样本的个数，n为样本的特征数

 %前向算法计算各神经网络节点的线性组合值和active值

 %W1为 hiddenSize*visibleSize的矩阵

 %data为 visibleSize* trainexampleNum的矩阵

 %remat(b1,,m)把向量b1复制扩展为hiddenSize*m列

 % 根据公式 Z^(l) = z^(l-)*W^(l-)+b^(l-)

 %z2保存的是10000个样本下隐藏层的输入，为hiddenSize*m维的矩阵，每一列代表一次输入

 z2= W1*data + remat(b1,,m)；%第二层的输入

 a2 = sigmoid(z2); %对z2取sigmod 即得到a2，即隐藏层的输出

 z3 = W2*a2+repmat(b2,,m); %output layer 的输入

 a3 = sigmoid(z3); %output 层的输出

 % 计算预测产生的误差

 %对应J(W,b), 外边的sum是对所有样本求和，里边的sum是对输出层的所有分量求和

 Jcost = (0.5/m)*sum(sum((a3-data).^));

 %计算权值惩罚项 正则化项，并没有带正则项参数

 Jweight = (/)*(sum(sum(W1.^))+sum(sum(W2.^)));

 %计算稀疏性规则项 sum(matrix,)是进行按行求和运算，即所有样本在隐层的输出累加求均值

 % rho为一个hiddenSize* 维的向量

 rho = (/m).*sum(a2,);%求出隐含层输出aj的平均值向量 rho为hiddenSize维的

 %求稀疏项的损失

 Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+(-sparsityParam).*log((-sparsityParam)./(-rho)));

 %损失函数的总表达式 损失项 + 正则化项 + 稀疏项

 cost = Jcost + lambda*Jweight + beta*Jsparse;

 %计算l =  即 output-layer层的误差dleta3，因为在autoencoder中输入等于输出h(W,b)=x

 delta3 = -(data-a3).*sigmoidInv(z3);

 %因为加入了稀疏规则项，所以计算偏导时需要引入该项，sterm为稀疏项，为hiddenSize维的向量

 sterm = beta*(-sparsityParam./rho+(-sparsityParam)./(-rho))

 % W2 为64*25的矩阵，d3为第三层的输出为64*10000的矩阵，每一列为每个样本x^(i)的输出，W2'为W2的转置

 % repmat(sterm,,m)会把函数复制扩展为m列的矩阵，每一列都为sterm向量。

 % d2为hiddenSize*10000的矩阵

 delta2 = (W2'*delta3+repmat(sterm,1,m)).*sigmoidInv(z2);

 %计算W1grad

 % data'为10000*64的矩阵 d2*data' 位25*64的矩阵

 W1grad = W1grad+delta2*data';

 W1grad = (/m)*W1grad+lambda*W1;

 %计算W2grad

 W2grad = W2grad+delta3*a2';

 W2grad = (/m).*W2grad+lambda*W2;

 %计算b1grad

 b1grad = b1grad+sum(delta2,);

 b1grad = (/m)*b1grad;%注意b的偏导是一个向量，所以这里应该把每一行的值累加起来

 %计算b2grad

 b2grad = b2grad+sum(delta3,);

 b2grad = (/m)*b2grad;

 %计算完成重新转为向量

 grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

 end

 %-------------------------------------------------------------------

 % Here's an implementation of the sigmoid function, which you may find useful

 % in your computation of the costs and the gradients.  This inputs a (row or

 % column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)).

 function sigm = sigmoid(x)

     sigm =  ./ ( + exp(-x));

 end

 %sigmoid函数的导函数

 function sigmInv = sigmoidInv(x)

     sigmInv = sigmoid(x).*(-sigmoid(x));

 end

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 对应step  %%%%%%%%%%%%%%%%%%%%%%%%%%%%

 %三个函数：（checkNumericalGradient）（simpleQuadraticFunction）（computeNumericalGradient）

 function [] = checkNumericalGradient()

 x = [; ];

 %当前简单函数实际的值与实际的导函数

 [value, grad] = simpleQuadraticFunction(x);

 % 在点 x 处计算简单函数的梯度，("@simpleQuadraticFunction" denotes a pointer to a function.)

 numgrad = computeNumericalGradient(@simpleQuadraticFunction, x);

 % disp()等价于 print()

 disp([numgrad grad]);

 fprintf('The above two columns you get should be very similar.\n(Left-Your Numerical Gradient, Right-Analytical Gradient)\n\n');

 % norm 等价于 sqrt(sum(X.^)); 如果实现正确，设置 EPSILON = 0.0001，误差应该为2.1452e-

 diff = norm(numgrad-grad)/norm(numgrad+grad);

 disp(diff);

 fprintf('Norm of the difference between numerical and analytical gradient (should be < 1e-9)\n\n');

 end

  %这个简单函数用来检验写的computeNumericalGradient函数的正确性

 function [value,grad] = simpleQuadraticFunction(x)

 % this function accepts a 2D vector as input.

 % Its outputs are:

 %   value: h(x1, x2) = x1^ + *x1*x2

 %   grad: A 2x1 vector that gives the partial derivatives of h with respect to x1 and x2

 % Note that when we pass @simpleQuadraticFunction(x) to computeNumericalGradients, we're assuming

 % that computeNumericalGradients will use only the first returned value of this function.

 value = x()^ + *x()*x();

 grad = zeros(, );

 grad()  = *x() + *x();

 grad()  = *x();

 end

 %梯度检验的函数

 function numgrad = computeNumericalGradient(J, theta)

 % theta: 参数，向量或者实数均可

 % J: 输出值为实数的函数. 调用y = J(theta)将会返回函数在theta处的值

 % numgrad初始化为0,与theta维度相同

 numgrad = zeros(size(theta));

 EPSILON = 1e-;

 % theta是一个行向量，size(theta,)是求行数

 n = size(theta,);

 %产生一个维度为n的单位矩阵

 E = eye(n);

 for i = :n

     % (n,:)代表第n行，所有的列

     % (:,n)代表所有行，第n列

     % 由于E是单位矩阵，所以只有第i行第i列的元素变为EPSILON

     delta = E(:,i)*EPSILON;

     %向量第i维度的值

     numgrad(i) = (J(theta+delta)-J(theta-delta))/(EPSILON*2.0);

 end

 %% ---------------------------------------------------------------

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 对应step  %%%%%%%%%%%%%%%%%%%%%%%%%%%%

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%关于函数的展示%%%%%%%%%%%%%%%%%%%%%%%%%%%

 function [h, array] = display_network(A, opt_normalize, opt_graycolor, cols, opt_colmajor)

 % This function visualizes filters in matrix A. Each column of A is a

 % filter. We will reshape each column into a square image and visualizes

 % on each cell of the visualization panel.

 % All other parameters are optional, usually you do not need to worry

 % about it.

 % opt_normalize: whether we need to normalize the filter so that all of

 % them can have similar contrast. Default value is true.

 % opt_graycolor: whether we use gray as the heat map. Default is true.

 % cols: how many columns are there in the display. Default value is the

 % squareroot of the number of columns in A.

 % opt_colmajor: you can switch convention to row major for A. In that

 % case, each row of A is a filter. Default value is false.

 warning off all

 if ~exist('opt_normalize', 'var') || isempty(opt_normalize)

     opt_normalize= true;

 end

 if ~exist('opt_graycolor', 'var') || isempty(opt_graycolor)

     opt_graycolor= true;

 end

 if ~exist('opt_colmajor', 'var') || isempty(opt_colmajor)

     opt_colmajor = false;

 end

 % rescale

 A = A - mean(A(:));

 if opt_graycolor, colormap(gray); end

 % compute rows, cols

 [L M]=size(A);

 sz=sqrt(L);

 buf=;

 if ~exist('cols', 'var')

     if floor(sqrt(M))^ ~= M

         n=ceil(sqrt(M));

         while mod(M, n)~= && n<1.2*sqrt(M), n=n+; end

         m=ceil(M/n);

     else

         n=sqrt(M);

         m=n;

     end

 else

     n = cols;

     m = ceil(M/n);

 end

 array=-ones(buf+m*(sz+buf),buf+n*(sz+buf));

 if ~opt_graycolor

     array = 0.1.* array;

 end

 if ~opt_colmajor

     k=;

     for i=:m

         for j=:n

             if k>M,

                 continue;

             end

             clim=max(abs(A(:,k)));

             if opt_normalize

                 array(buf+(i-)*(sz+buf)+(:sz),buf+(j-)*(sz+buf)+(:sz))=reshape(A(:,k),sz,sz)/clim;

             else

                 array(buf+(i-)*(sz+buf)+(:sz),buf+(j-)*(sz+buf)+(:sz))=reshape(A(:,k),sz,sz)/max(abs(A(:)));

             end

             k=k+;

         end

     end

 else

     k=;

     for j=:n

         for i=:m

             if k>M,

                 continue;

             end

             clim=max(abs(A(:,k)));

             if opt_normalize

                 array(buf+(i-)*(sz+buf)+(:sz),buf+(j-)*(sz+buf)+(:sz))=reshape(A(:,k),sz,sz)/clim;

             else

                 array(buf+(i-)*(sz+buf)+(:sz),buf+(j-)*(sz+buf)+(:sz))=reshape(A(:,k),sz,sz);

             end

             k=k+;

         end

     end

 end

 if opt_graycolor

     h=imagesc(array,'EraseMode','none',[- ]);

 else

     h=imagesc(array,'EraseMode','none',[- ]);

 end

 axis image off

 drawnow;

 warning on all
巴特西

CS229 6.5 Neurons Networks Implements of Sparse Autoencoder

最新文章

热门文章