For finding local minima of differentiable functions, the work-horse algorithm of machine learning is gradient descent.
Gradient descent is really a special case of a wider family of non-linear optimization algorithms, and recognizing this fact allows us to address constraints, consider alternative metrics, and be more explicit about the approximations we use and the sequence of subproblems we solve.
The most basic form of gradient descent, where we seek to minimize an objective function over decision variable , is
where is some step-size that we must choose.
This equation is the solution that minimizes a local approximation of plus a regularization term that penalizes us for choosing that results in large updates. That is, gradient descent generates a sequence of candidate solutions by solving a sequence of subproblems of the form
Specifically, Eq. (1) solves
Exercise: Prove this by solving .
There are a number of generalizations of Eq. (2) beyond Eq. (3). For example,
- We may choose non-Euclidean update metrics .
- We may choose based on multiple past iterates.
- We may choose surrogate estimates .
- We may replace with a Lagrangian to address constraints.
In the following sections, I give examples of each of these generalizations.
Let us consider linear approximations of the form
and distances of the form
where is a (pseudo)metric on the tangent space of .
Under these conditions, it follows that
where is the Moore-Penrose (pseudo)inverse of . This update rule is commonly known as “preconditioned” gradient descent.
Importantly, these conditions correspond to the limit of updates in continuous-time for the true objective and any with positive-semidefinite Hessian, wherein
(Amid and Warmuth (2020)). To see this, let
It follows that
Note that we use the dot (e.g., over and ) to represent a time-derivative.
If our objective function is convex, we are guaranteed that the Hessian of , which we denote as , is positive semi-definite. That is,
When we substitute for in Eq. (5) (i.e., when we choose based on the Hessian of ), then the resulting subproblem (Eq. (2)) is simply a minimization over a 2nd-order Taylor approximation of at . The resulting update is known as Newton’s Method.
When used for solving least-squares problems (for residuals ), the Hessian may be calculated as
Ignoring the second term on the last line and retaining only the first as an approximation for yields the Gauss-Newton algorithm. For linear models, the second term is identically zero.
We may also choose values of that derive from spaces other than . For example, when parameterizes a probability distribution , from which the objective function derives, we may measure the magnitude of an update from to by the relative entropy from to .
This choice yields the update rule for Fisher-Rao natural gradient descent:
where is the Fisher metric tensor
One way to understand the effect of the metric is that it (un)warps the space of marginal updates , such that updates of the same induced magnitude contain the same amount of marginal information about .
In continuous time, Fisher-Rao natural gradient descent optimally approximates replicator dynamics (used to model evolution by natural selection) and continuous Bayesian inference (Raab et. al. (2022)). We may also choose different divergences in space, yielding other forms of “natural gradient descent” (Nurbekyan et. al. (2022)).
Vanilla gradient descent does not require a Euclidean metric.
As a counterexample, let where the numerator is an outer product and the denominator an inner product of with itself.
It follows that Therefore, the update rule that solves (Eq. (2)) is vanilla gradient descent:
So far, we have only discussed updates that rely on at most one previous iterate of (or or ) to regularize updates with a penalty term , thus defining “trust-regions” in which the approximation is assumed to be valid. We may also choose update penalties that rely on additional history of the solution candidates .
An example of a multi-iterate approach is given by the classical “Momentum” variant of gradient descent. Consider, for example, a penalty term that quadratically penalizes step-sizes with a linear term that encourages steps in same continued direction, as in
where . Alternatively, consider a quadratic penalty for the difference between the proposed update and the previous update with decay , as in
In either case,
If we solve Eq. (2) by setting for either choice of , the resulting update rule is
which can be decomposed, with an additional state variable , to
This update rule is the classical “Momentum” algorithm (Botev et. al. (2016)) with decay parameterized by .
There is yet another update penalty that may be used to derive the Momentum update rule. Specifically, let
We can naturally extend this distance penalty to account for higher-order derivatives as
where we use the state variable to represent the th time-derivative of , defined empirically by the recursive formula
By this choice of , if we solve Eq. (2) by setting at , we obtain the equation
and therefore, the update rules
For , Letting and , we recover the standard momentum update.
As varies from to , the minimum average error of a local approximation of along this path can be reduced by choosing an intermediate point , somewhere between and , instead of , about which to approximate the local behavior of .
Dealing with Constraints
Consider the basic (constrained) optimization problem
for vector-valued and .
In terms of the corresponding Lagrangian , the primal problem may be written
Intuitively, given that and are chosen adversarially (i.e., after is fixed), the problem is to choose such that the constraints on and are satisfied (otherwise, the objective can be made unboundedly positive by an adversary’s choice of ). Within this “feasible set” of values, should be minimized.
introducing update penalties for state variables , , and .
For the simple choices
and local, linear approximations
we have the update rules
where is taken element-wise and the gradients are taken with respect to (and these gradient components remain uncontracted with the dimensions of or ).
In practice, it is common to eliminate the mutual dependence between variables in Eq. (14) by replacing , , and on the right-hand side of each equation, thus yielding the standard primal-dual algorithm, which maintains iterates for the dual variables and in addition to the primal variable . The error of this approximation has order , though this is not strictly necessary.
When and the true solution to Eq. (14) may be found by iterating the recursive map
such that .
Generalizing Fletcher’s Method
In this section, we show how a rather interesting penalty choice for the dual variable , as used above, completely eliminates the need to track as an independent variable all and provides a straight-forward method for constrained optimization based on local gradients of the objective and the constraint.
Consider the problem
for vector and scalar-valued and .
Given standard assumptions such as convexity, this problem may be solved the sequence of subproblems
which, as in Eq. (4), we express using the local, linear approximations and . Note that we have no need to track the value of between iterates of the above subproblems. This is because, instead of penalizing the update magnitude , we penalize the distance of away from the value that renders a critical point of the Lagrangian. We have omitted an explicit time-index on despite the fact that its value may change with each iteration.
Solving the above subproblem, has the explicit solution given by
where represents an estimate for the gradient of the Lagrangian at .
Manipulating the above equations, we see that
From which it follows
By similar logic,
This suggests the following, alternative expression for our subproblem:
Intuitively, when is positive, in order to make progress towards the constraint , we ensure that the update is aligned with the negative gradient of . When is negative, and the constraint already satisfied, the update cannot be too aligned with increasing . Subject to these constraints may be chosen to make progress towards decreasing the objective .
- Conjugate Natural Selection. Raab et. al. (2022)
- Efficient Natural Gradient Descent Methods for Large-Scale PDE-Based Optimization Problems. Nurbekyan et. al. (2022)
- Reparameterizing Mirror Descent as Gradient Descent. Amid and Warmuth (2020)
- Advanced Algorithms (Lecture Notes). Anupam Gupta (2020)
- Nesterov’s accelerated gradient and momentum as approximations to regularised update descent. Botev et. al. (2016)
- Convex Optimization. Boyd and Vandenberghe (2009)
- Numerical Optimization. Nocedal and Wright (2006)
- Trust Region Methods. Conn, Gould, and Toint (2000)