2 - ML Refresher / Softmax Regression

Date (online)

9/15

Instructor

Kolter

Slides

Video

Basics of machine learning Machine learning as data-driven programmin Three ingredients of a machine learning algorithm Example: softmax regresssion Multi-class classification setting Matrix batch notation Loss function #1: classification error Loss function #2: softmax / cross-entropy loss The softmax regression optimization problem Optimization: gradient descent. Stochastic gradient descent The gradient of the softmax objective Putting it all together

本节主要讲了机器学习的基础知识，然后以softmax regression为例具体展开

Basics of machine learning

Machine learning as data-driven programmin

通过一个实际问题来引入了机器学习。该问题使用常规的编程很难解决，但可以用机器学习方法很好解决。

Three ingredients of a machine learning algorithm

Every machine learning algorithm consists of three different elements:

The hypothesis class: the “program structure”, parameterized via a set of parameters, that describes how we map inputs (e.g., images of digits) to outputs (e.g., class labels, or probabilities of different class labels)

The loss function: a function that specifies how “well” a given hypothesis (i.e., a choice of parameters) performs on the task of interest

An optimization method: a procedure for determining a set of parameters that (approximately) minimize the sum of losses over the training set

和《统计学习方法. 周志华》中所阐述的一样。机器学习算法包括三部分：模型、策略、算法。其中：

模型有一系列参数决定，它确定了一组输入对应的输出；

策略其实就是损失函数，用来衡量给定一个模型在所关注任务上执行效果的好坏；

算法即某种优化方法，用来确定使得损失值最小的时的模型参数。

Example: softmax regresssion

Multi-class classification setting

这里写作，只是为了方便表示，因为和向量的点乘相似。事实上，马上就被矩阵运算取代了。

Matrix batch notation

Often more convenient (and this is how you want to code things for efficiency) to write the data and operations in matrix batch form

Then the linear hypothesis applied to this batch can be written as

Note:

这里写为batch的形式，不仅是数学上表达更为方便，同时在运算上转为了矩阵相乘，可以有多种并行优化的算法用来提速。

上述从向量转为矩阵描述的过程适用所有ML 算法，至少了熟练能推导。

Loss function #1: classification error

The simplest loss function to use in classification is just the classification error, i.e., whether the classifier makes a mistake a or not：

We typically use this loss function to assess the quality of classifiers.

Unfortunately, the error is a bad loss function to use for optimization, i.e., selecting the best parameters, because it is not differentiable.

NOTE:

上述公式中 h(x) 为预测值，是一个 k维向量，在本章第一节有描述

上述定义的损失函数只是适用来评估模型的好坏（即量化模型的性能），比如计算准确率。但不适用来优化模型，因为其不可微分，具体来说，就是如果稍微改变了模型参数，但是该损失值可能不会改变，这就导致了该损失函数不能用来寻找最佳参数

Loss function #2: softmax / cross-entropy loss

Let's convert the hypothesis function to a "probability" by exponentiating and normalizing its entries (to make them all positive and sum to one)

Then let's define a loss to be the (negative) log probability of the true class: this is called softmax or cross-entropy loss

NOTE:

Tips: 交叉熵损失函数又称为 softmax，只不过交叉熵更为专业术语

模型的输出值有点像某种信念，为了把它转化为似概率的东西，或者说把输出和概率联系起来，这里先对其做指数保证了值为正数，然后再进行归一化，保证了和为1。（由概率公理可得，概率值取值范围为[0,1]，且和为1。）这样就使得模型的输出满足某个概率分布

这个归一化是对一个向量，即模型对每个样本的输出来归一化的，这个操作又称之为softmax。然后一般是和负对数融合的一种操作

为了数值计算的方便，通过 log 将概率相乘转化为相加。

一般不会直接计算softmax，而是将softmax和 -log 结合起来，计算最后化解的式子，因为我们关注的是loss，同时指数运算复杂度较高

The softmax regression optimization problem

The third ingredient of a machine learning algorithm is a method for solving the associated optimization problem, i.e., the problem of minimizing the average loss on the training set:

For softmax regression (i.e., linear hypothesis class and softmax loss):

So how do we find that solves this optimization problem ?

NOTE：

优化问题：最小化模型在所有样本的平均误差。这是机器学习的核心问题

上述第一个式子，为假设函数，即模型。为损失函数。所有机器学习监督模型最后都要走这一步或者能化解为这一步（可能有些微小的不同，比如正则项），即最小化训练集上平均误差或者总误差。

上述第二个式子为本例中softmax 回归的损失函数。

Optimization: gradient descent.

For a matrix input, scalar output function , the gradient is defined as the matrix of partial derivatives

Gradient points in the direction that most increases (locally)

To minimize a function, the gradient descent algorithm proceeds by iteratively taking steps in the direction of the negative gradient

where is a step size or learning rate.

NOTE：

矩阵的梯度和矩阵的形状一定是一样的

注意这里是偏导数，求第一个篇导的时候，其他的变量均为常数值！

中的下标表明的要对谁（对哪个变量）求微分；括号中的表明当前变量的取值，即在哪个位置求梯度；

梯度指向函数所在当前位置增加最快的方向。因此，只需要令参数不断的向负梯度方向移动，即可使得损失值降低。

Stochastic gradient descent

If our objective (as is the case in machine learning) is the sum of individual losses, we don’t want to compute the gradient using all examples to make a single update to the parameters

Instead, take many gradient steps each based upon a minibatch (small partition of the data), to make many parameter updates using a single “pass” over data

NOTE：

使用全部的数据，每次计算梯度，都需要计算所有样本上的梯度。计算消耗太大了，甚至数据可能无法全部存放到内存中。

因此有了随机梯度下降，每次迭代只随机选取一部分数据来执行梯度下降。

随机梯度下降，每次迭代时，都是用一部分数据的梯度来近似全部数据的梯度（真实梯度）。这样做的主要优势是快！其次在某些有噪声的数据上，随机梯度下降有了随机性可以跳出鞍点。

上式中为 batch 的大小。

The gradient of the softmax objective

So, how do we compute the gradient for the softmax objective?

Let’s start by deriving the gradient of the softmax loss itself: for vector

So, in vector form:

其中，为只有位置元素为，其余元素为的列向量。

So how do we compute the gradient

The chain rule of multivariate calculus ... but the dimensions of all the matrices and vectors get pretty cumbersome

Approach #1 (a.k.a. the right way): Use matrix differential calculus, Jacobians, Kronecker products, and vectorization

Approach #2 (a.k.a. the hacky quick way that everyone actually does ): Pretend everything is a scalar, use the typical chain rule, and then rearrange / transpose matrices/vectors to make the sizes work 😱 (and check your answer numerically)

NOTE：

计算多变量的微分，可以按照标准的算法来，涉及到 Kronecker 积、约当标准等，比较复杂；也可以把向量或者矩阵看作普通的标量然后使用链式法则，计算得到结果之后，再通过交换位置、转置使得结果形状正确。

Let’s compute the “derivative” of the loss:

红色部分维度为 , 绿色部分形状为，形状应该为，因此：

上面推导了单个样本的情况，以下推导批量使用矩阵形式的梯度：

红色部分形状为 m*k，绿色部分形状为 m*n，但是形状为。因此：

其中为每行一个的样本的标签的 one-hot 向量，中的每行均为一个样本的模型预测向量的softmax归一化之后的结果。

Putting it all together

Repeat until parameters / loss converges

Iterate over minibatches of training set；

Update the parameters