157
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Approximating Hessian matrices using Bayesian inference: a new approach for quasi-Newton methods in stochastic optimization

, &
Received 08 Sep 2022, Accepted 13 Mar 2024, Published online: 29 Apr 2024

Abstract

Using quasi-Newton methods in stochastic optimization is not a trivial task given the difficulty of extracting curvature information from the noisy gradients. Moreover, pre-conditioning noisy gradient observations tend to amplify the noise. We propose a Bayesian approach to obtain a Hessian matrix approximation for stochastic optimization that minimizes the secant equations residue while retaining the extreme eigenvalues between a specified range. Thus, the proposed approach assists stochastic gradient descent to converge to local minima without augmenting gradient noise. We propose maximizing the log posterior using the Newton-CG method. Numerical results on a stochastic quadratic function and an 2-regularized logistic regression problem are presented. In all the cases tested, our approach improves the convergence of stochastic gradient descent, compensating for the overhead of solving the log posterior maximization. In particular, pre-conditioning the stochastic gradient with the inverse of our Hessian approximation becomes more advantageous the larger the condition number of the problem is.

AMS subject classifications:

1. Introduction

First-order optimization methods are a common choice for performing local search when gradients are available. In the strongly convex case, the gradient descent method converges linearly, requiring a number of iterations proportional to the condition number [Citation21, Theorem 2.1.15], that is, the ratio between the upper and lower bounds of the eigenvalues of the Hessian matrix in the search domain. However, first-order methods may not be a practical choice when the condition number for a problem is large, advancing little towards the optimum each iteration. One way to mitigate this dependency is to use quasi-Newton methods, which use the secant equation to obtain an approximation of the Hessian matrix of the objective function from subsequent gradient evaluations. Then, one may precondition the gradient with the inverse of the Hessian approximation, furnishing a Newton-like method without the cost of computing the true Hessian or using a finite differences type of approximation [Citation10].

In the case of stochastic optimization, the stochastic gradient descent method (SGD) is the canonical first-order method. The asymptotic convergence of SGD is dominated by noise in gradient evaluations (cf. [Citation24, Theorem 3]), thus controlling the condition number for the problem does not improve convergence. However, controlling the relative statistical error of gradient estimates using a uniform bound recovers linear convergence per iteration for the minimization of strongly convex functions [Citation8,Citation24]. In this case, the convergence rate depends on the condition number of the problem; hence preconditioning the gradient with the Hessian inverse might be desirable. The natural approach is to use quasi-Newton methods to approximate the Hessian matrix, but devising quasi-Newton methods for stochastic optimization is challenging due to its noisy gradient observations. Naively fitting the secant equations on noisy gradient observations often leads to catastrophic errors in Hessian approximations [Citation24, Section 4.3.3].

There are many works addressing the development of quasi-Newton methods in stochastic optimization. One of the first attempts is that of Schraudolph, Yu, and Günther [Citation26] where they adapt the L-BFGS method to the stochastic setting by adding damping parameters that control the effect of noise, furnishing the oL-BFGS method. Their approach is empirical, and not much can be said about the quality of the Hessian approximation. Bordes and Pierre [Citation3] propose a diagonal rescaling matrix for linear support vector machines based on oL-BFGS, providing an analysis of their approach. In their numerical tests, their method matched oL-BFGS in convergence per epoch but with a significant improvement in terms of runtimes. Hennig [Citation15] uses a Bayesian approach to learn about an objective function from noisy gradient evaluations. Byrd et al. [Citation6] use a sample average approximation to compute the search direction in Newton-CG and L-BFGS. As a consequence, they do not have a statistical error on their estimates but a bias. Hennig [Citation15] models the prior distribution of the objective function as a Gaussian process with a squared exponential kernel and, using the linearity of Gaussian processes, derives prior distributions for the gradient and the Hessian of the objective function. Then, assuming Gaussian likelihoods for both the gradient and the Hessian of the objective function, Hennig [Citation15] deduces closed-form equations for the mean of the (also Gaussian) posterior distribution. Sohl-Dickstein, Poole, and Ganguli [Citation28] develop a quasi-Newton method for sums of functions where they keep a quadratic approximation for each of these functions, subsampling them at each iteration and updating the respective quadratic approximation with new gradient information. They use a shared, low-dimensional subspace to avoid the cost of storing a Hessian matrix for each function. Byrd et al. [Citation7] propose a stochastic quasi-Newton method comprising a two-loop scheme where the outer loop updates the curvature pairs in an L-BFGS scheme (with large sample sizes), and the inner loop performs preconditioned stochastic gradient descent (with lower sample sizes). Thus, the method proposed by Byrd et al. [Citation7] does not incorporate the information from the noisy gradients of the inner loops in the Hessian approximation. Moritz, Nishihara, and Jordan [Citation20] improve the stochastic quasi-Newton method proposed by Byrd et al. [Citation7] by combining it with a control variate method to estimate the gradient at each iteration, achieving linear convergence.

Gower, Goldfarb, and Richtárik [Citation13] introduce a block BFGS method to update the Hessian approximation each iteration using a sample independent of the sample used to compute the gradient. The proposed block update fits the Hessian inverse to a sketched Hessian computed from a sub-sample of the data, but requires a Cholesky decomposition each iteration. Wang et al. [Citation30] propose a damped quasi-Newton method for stochastic optimization that uses all noisy gradient evaluations to approximate the Hessian inverse. They provide a convergence proof for nonconvex problems, provided the Hessian eigenvalues have uniform lower bound. Li [Citation18] preconditions the stochastic gradient estimates using a matrix computed to fit the secant equation while avoiding amplifying gradient estimate noise. Bollapragada et al. [Citation2] use standard L-BFGS equations in stochastic optimization by increasing sample sizes to keep the relative error for the gradients uniformly bounded. Bollapragada et al. [Citation2] recommend a stochastic line-search procedure to be used with their method. Goldfarb, Ren, and Bahamou [Citation12] develop a quasi-Newton method for deep neural networks that approximates Hessian inverses as block matrices such that each block is decomposed as the kronecker products of smaller matrices. Lately, Shi et al. [Citation27] build on top of their analysis in [Citation31] to build a quasi-Newton method for stochastic optimization with a noise-resilient Wolfe–Armijo line search. Their method, however, requires the gradient estimate to be bounded above in norm by a known constant, which they use in their line search. We refer the interested reader to [Citation4,Citation19] for extensive reviews of quasi-Newton methods for stochastic optimization.

Most approaches to tackle the condition number issue in stochastic optimization use equations from deterministic quasi-Newton methods and somehow circumvent the randomness. Here, we propose a Bayesian framework to build a posterior distribution for the Hessian matrix from noisy gradient observations. In Bayesian inference, a posterior distribution is built from prior and likelihood distributions. We model the prior distribution for the Hessian matrix with a probability density function (pdf) that decreases exponentially as the Hessian moves farther away (in the Frobenius norm sense) from the initial Hessian estimate. This approach is motivated by classic deterministic quasi-Newton methods, where a Frobenius norm regularizer centred at the current Hessian is used to circumvent ill-posedness [Citation10]. Moreover, we impose constraints on the Hessian matrix eigenvalues extremes, enforcing positive-definiteness and gradient smoothness. We model the likelihood using the secant equation, i.e. how likely it is to obtain the observed gradient differences for a given Hessian. The Hessian that maximizes the posterior distribution, i.e. the Maximum a Posteriori (MAP), is defined as the Hessian approximation.

The choice of prior and likelihood distributions guarantees that the negative log posterior is strongly-convex. Therefore, we opt to use the Newton-CG method to find the MAP. Although it is possible to compute the second-order derivatives of the negative log posterior of the Hessian directly, it could be expensive and memory intensive. Keep in mind that, for a d-dimensional problem, the Hessian matrix is a d×d matrix, thus, the second-order derivatives of the log posterior must be gathered in a fourth-order tensor with d4 components. Instead of assembling this tensor, we compute its action over any symmetric matrix of adequate dimensions and then use the conjugate gradient method to find the Newton direction. Moreover, to enforce the stability of Newton-CG, we use the central-path interior point method to impose the prior conditions on the eigenvalues of the Hessian approximation, i.e. we start with a large penalization parameter and decrease it in steps, keeping the problem well-conditioned. If properly tuned, our approach not only guarantees that Newton-CG converges to the MAP as it also results in a slow decay for the Hessian approximation's smallest eigenvalue, thus improving the stability for the SGD preconditioned with our Hessian approximation.

Other works address the problem of approximating a Hessian from noisy gradient observations. Hennig [Citation15] also uses Bayesian inference; however, to build a posterior distribution of the objective function as a Gaussian process. The main difference between the proposed approach and that of Hennig [Citation15] is in the way he models the prior and likelihood distributions. Moreover, Hennig [Citation15] assumes independence between the components of the gradient noise in his analysis, and also that they have the same variance. In the same spirit of the present work, Li [Citation18] also solves an optimization sub-problem each iteration to find a Hessian approximation from noisy gradients. His approach differs from ours in the way he uses the available information to estimate the preconditioning matrix; here, we use a Bayesian approach to build a posterior distribution of the Hessian matrix while [Citation18] obtains a point estimate by minimizing a different objective function.

To illustrate the effectiveness of our approach, we solve two stochastic optimization problems. The first is minimizing a quadratic function, and the second is training an 2 regularized logistic regression model. In all cases, we start with SGD preconditioned with a diagonal matrix and update the preconditioning matrix in intervals using the proposed Bayesian approach. We study cases with up to 300 design variables and condition numbers of orders up to 109. We present results combining the proposed Bayesian Hessian approach with different variations of SGD, namely, vanilla SGD, SGD-MICE [Citation8], SVRG [Citation16], and SARAH [Citation22].

1.1. Problem definition

The stochastic optimization problem is stated as follows. Let ξ be the design variable in dimension dξ and θ be a vector-valued random variable in dimension dθ equipped with a probability density function ϱ:RdθR+. Here, E[|ξ] and V[|ξ] are the expectation and variance operators conditioned on ξ, respectively. Aiming at minimizing the expectations conditioned on ξ, we state the problem as (1) ξ=argminξΞE[f(ξ,θ)|ξ],(1) where ΞRdξ is a feasible set and f:Ξ×RdθR. Through what follows, let F(ξ):=E[f(ξ,θ)|ξ] where F:ΞR.

In minimizing (Equation1) with respect to the design variable ξΞ, vanilla SGD is constructed with the update rule (2) ξk+1=ξkηkυk,(2) where ηk is the step size at iteration k, and υk is an unbiased estimator of the gradient of F at ξk. For instance, an unbiased estimator υk of the gradient of F at ξk may be constructed using a Monte Carlo estimator, (3) ξF(ξk)=E[f(ξk,θ)|ξk]υk:=1bi=1bξf(ξk,θi),(3) with b independent and identically distributed (iid) random variables θiϱ. The estimator (Equation3) is a random variable and its use in optimization algorithms gives rise to the so-called Stochastic Optimizers.

First-order methods are the most popular choice in stochastic optimization due to their robustness to noise. However, these methods might suffer from slow convergence when the condition number of the problem is large. Here, we are interested in finding a matrix Bˆk that approximates 2F(ξk) so that its inverse can be used as a preconditioner for the stochastic gradient, (4) ξk+1=ξkηkBˆk1υk,(4) thus diminishing the effect of the condition number on the optimization convergence. In the deterministic setting, quasi-Newton methods circumvent the sensibility to the condition number by using the gradient observations to construct the Hessian matrix approximation at the current iterate. However, in the stochastic optimization setting, we first have to address the following questions:

Q1:

How does one obtain a Hessian approximation from noisy gradient observations?

Q2:

How do we avoid amplifying the noise in gradient estimates with our preconditioner matrix Bˆk1?

Q3:

To mitigate the effect of the condition number, is it enough to have an accurate Hessian approximation, however, acting on noisy gradient estimates?

To address Q1, consider the curvature pair yk:=ξF(ξk)ξF(ξk1) and sk:=ξkξk1. Thus, we first aim to find the Hessian approximation that minimizes the residue of the secant equation, that is, (5) minBRdξ×Rdξ=1kBsy2.(5) However, we do not have access to yk:=E[ξf(ξk,θ)|ξk]E[ξf(ξk1,θ)|ξk1], instead, we may sample finitely many times yˆk(θ):=ξf(ξk,θ)ξf(ξk1,θ). In Section 2, we discuss how to use Bayesian inference to find a matrix Bˆk given the noisy curvature pairs yˆ and s.

One of the main problems faced by preconditioning in stochastic optimization is the amplification of the noise in gradient estimates. This amplification is proportional to the smallest eigenvalue of matrix Bˆk, (6) E[Bˆk1υkE[Bˆk1υk|Bˆk1]2|Bˆk1]Bˆk12E[υkE[υk]2|Bˆk1],(6) by the matrix-norm compatibility. Thus, to tackle Q2, we impose a lower bound constraint on the eigenvalues of Bˆk, i.e. Bˆkμ~I.

To address Q3 in a numerical fashion, we investigate and compare stochastic gradients preconditioned with our Hessian inverse approximation with and without control in the relative statistical error.

Regarding notation, in this work, we use bold-face lower-case letters as vectors and bold-face upper-case letters as matrices. Moreover, we refer to first-order derivatives of functions with respect to matrices as gradients, even though they are also matrices.

2. Approximating the Hessian and its inverse using Bayesian inference

Important to what follows is the Monte Carlo estimator of yk (7) y¯k:=1bki=1bkyˆk(θi),(7) with bk iid random variables θiϱ.

2.1. Bayesian formulation of the Hessian inference problem

For the pair of sets Y:={y¯}=1k and S:={s}=1k, we employ the Bayes' formula to build a posterior distribution of the Hessian B given Y and S, (8) π(B|S,Y)p(S,Y|B)π(B),(8) where the likelihood distribution p(S,Y|B) and the prior distribution π(B) will be discussed in Sections 2.1.1 and 2.1.2, respectively. We opt to use as a Hessian approximation the matrix that minimizes the negative log posterior subject to the eigenvalues constraints (9) Bk:={argminB(Lk(B):=log(π(B|S,Y)))subjecttoλmin(B)μ~λmax(B)LB=B,(9) where λmin and λmax are operators that return the minimum and maximum eigenvalues, respectively; L is the smoothness constant of F; and μ~>0 is an arbitrary parameter. If F is μ-convex, μ~ can be chosen as μ, however, a larger μ~ might be desirable to avoid amplifying the noise in the search direction. Since computing Bk is not possible in general, we use an approximation Bˆk, obtained numerically, as a Hessian approximation.

2.1.1. Likelihood of observing curvature pairs given a Hessian

We use the secant equations to derive the likelihood of observing the curvature pairs (S,Y) given a Hessian B. Let the difference in gradients be as in (Equation7). From the observations yˆk, we build a sample covariance matrix Σk. Then, the covariance matrix of the estimator y¯k is given by (10) Σ¯k:=Cov[y¯k]=Σkbk.(10) We are interested in the inverse of Σ¯k. However, to avoid singularity issues, we use a precision matrix approximation P instead of Σ¯1 for each y¯Y. We define P as (11) P:=(Σ¯+σP(max(Σ¯))Idξ)1,(11) where σP is a regularization parameter and Idξ is the identity matrix of size dξ. Then, we define the negative log likelihood of curvature pair sk and y¯k given a matrix B as (12) log(p(sk,y¯k|B))=12νBsky¯kPk2+Ck,(12) thus, for S and Y generated by a first-order stochastic gradient descent method, (13) log(p(S,Y|B))=12ν=1kBsy¯P2+C.(13) In the above equations, C=i=kC are constants and ν is used to normalize the negative log likelihood, (14) ν==1kP(Bˆk1sy¯)s,(14) being Bˆk1 a Hessian approximation at iteration k−1.

2.1.2. Prior distribution of the Hessian matrix

The prior distribution must encode the knowledge available about the Hessian before the observations of S and Y. In spirit of quasi-Newton methods, we want to discourage a Hessian that differs much from the current estimate. For this reason, we adopt a negative log prior of B as the squared weighted Frobenius norm of the difference between B and Bˆk1,

for a nonsingular symmetric matrix of weights W. Another information available about the matrix B is that its eigenvalues must lay between μ~ and L, as described in the constraints of (Equation9). To enforce these eigenvalue constraints on B, we define the prior distribution from its negative logarithm as (16) log(π(B))={12BBˆk1F,W2+Cif μˆIdξBLˆIdξ+otherwise,(16) where again C is a normalizing constant. Here, we opt to impose the eigenvalues constraints on the negative log prior as logarithmic barriers of form (17) log(det(BμˆIdξ))log(det(LˆIdξB)),(17) where (18) μˆ:=μ~α,Lˆ:=,(18)

and α1 is a relaxation parameter. Hence, we propose a family of priors with logarithmic barriers with hyperparameter β as (19) log(π(B;β))=ρ2BBˆk1F,W2βlog(det(BμˆIdξ))βlog(det(LˆIdξB))+C,(19) being supported on the set of symmetric positive-definite matrices with eigenvalues between μˆ and Lˆ.

The parameter ρ>0 weights the Frobenius regularization term and β>0 weights the logarithmic barrier terms. Note that log(π(B;β))log(π(B)) as β0 with ρ=1.

Some options for W are Idξ and . According to [Citation10], the Frobenius norm as in (15) with weights measures the relative error of B with respect to Bˆk1. In practice, estimating can be expensive. In these cases, we recommend using W=Idξ.

2.1.3. Posterior distribution of the Hessian matrix

Substituting (Equation13) and (Equation19) in (Equation8), we obtain the posterior distribution of the Hessian matrix B given curvature pairs S and Y. The negative log posterior with parameter β is (20) Lk(B;β):=12ν=1kBsy¯P2+ρ2BBˆk1F,W2βlog(det(BμˆIdξ))βlog(det(LˆIdξB))+C.(20) The Equation (Equation20) can be interpreted as the sum of the secant equation residues for each curvature pair (in the metrics induced by P), subject to the logarithmic barrier terms that impose the eigenvalues constraints, and a Frobenius norm regularization term.

Since we are interested in minimizing the negative log posterior in (Equation20), it is our interest to know if it is strongly convex. The negative log-likelihood is convex but not strictly-convex until enough linearly independent curvature pairs (s,y) are observed. Regarding the logarithmic barrier terms of the prior, the negative logarithm of the determinant defined on symmetric positive-definite matrices is a convex function [Citation5, Section 3.1.5]; however, not a strictly-convex one. The Frobenius norm regularization term with W=Idξ is ρ-convex, therefore the negative log posterior is strongly convex with parameter ρ. Here, the regularization term works as a Tikhonov regularization term (AFvec(A)2), guaranteeing uniqueness of solution. Properly tuning β and ρ guarantees that the problem of minimizing (Equation20) is well-conditioned. A discussion on how to choose β is presented in Section 2.2.

2.2. Finding the posterior maximizer

Finding the Bk that solves the problem in (Equation9) is not a trivial task. Since the negative log posterior is ρ-convex, local search methods like the Newton method can be used to find its minimizer. However, the Newton method requires assembling the fourth-order tensor of second-order derivatives of the negative log posterior, which can be expensive in terms of processing and memory allocation. Instead, we opt to derive the action of this fourth-order tensor of second-order derivatives over any given matrix and use the conjugate gradient method to find the Newton direction, a procedure known as the Newton-CG method. Since we have a constraint in (Equation9) that B should be symmetric, the first-order optimality condition is (21) sym(BLk(B;β))=0,(21) where the sym() operator returns the symmetric part of its argument, e.g. sym(A)=12(A+AT). Thus, to use the Newton-CG method to solve this problem, every iteration we need to find the direction D that satisfies (22) sym(δDBLk(B;β))=sym(BLk(B;β)),(22) where δD is the directional derivative with respect to B in the direction of D, (23) δDg(B)=limϑ0g(B+ϑD)g(B)ϑ.(23) To avoid ill-conditioning, we need a large enough regularization parameter ρ, but not too large since it introduces bias in the posterior maximizer. Also, the bias given by the log-barriers grows with β, the logarithmic barrier parameter; but a small β negatively impacts the smoothness of the posterior distribution. We opt to keep ρ fixed and use an interior-point central-path method [Citation5, Section 11.2.2] to decrease the parameter β; the solution Bˆk(β) of the approximated problem (24) Bˆk(β)=argminB(Lk(B;β))(24) converges to the solution Bk of (Equation9) as β0. Thus, we start with a large β and a large tolerance tol and perform optimization until the norm of the gradient of the log posterior is below tol. Then, we decrease both β and tol by a factor γ and repeat the process. This procedure is repeated until the desired tolerance is achieved with the desired β. To minimize the negative log posterior for a given β, we use the Newton-CG method with backtracking-Armijo line search; we use the conjugate gradient method to find the Newton direction and then the backtracking-Armijo to find a step size. Therefore, to use the Newton-CG, we need the gradient of Lk with respect to B and the directional derivative of the gradient of Lk with respect to B in the direction of any arbitrary symmetric matrix VRdξ×Rdξ. The pseudo-code for the minimization of the negative log posterior is presented in Algorithm 1. We use the central-path method to decrease β and tol every step outer loop, indexed as i in Algorithm 1. In lines 6 to 10, we have the Newton-CG method, where, each iteration, the conjugate gradient method is used to find the approximate Newton direction D and a backtracking-Armijo line-search is used to find the step size.

Next, we will present how to compute BLk(Bˆk;β) and δVBLk(Bˆk;β) for any symmetric VRdξ×Rdξ. Note that, since B is a matrix, BLk and δVBLk are also symmetric matrices of the same dimensions of B.

2.3. Gradient of the negative log posterior

To compute BLk, we need to compute the gradients of the log prior and of the log likelihood, (25) BLk(B;β)=Blog(p(S,Y|B))Blog(π(B;β)).(25) The gradient of the negative log likelihood can be calculated from (Equation13) as (26) Blog(p(S,Y|B))=12ν=1kBBsy¯P2(26) (27) =1ν=1kP(Bsy¯)s.(27) The gradient of the negative log prior can be derived from (Equation19) as (28) Blog(π(B;β))=ρ2BW12(BBˆk1)W12F2βBlog(det(BμˆIdξ))βBlog(det(LˆIdξB)).(28) The gradient of the Frobenius norm regularizer is (29) 12BW12(BBˆk1)W12F2=W(BBˆk1)W,(29) whereas the gradients of the logarithmic barrier terms are derived using the Jacobi formula, (30) Blog(det(BμˆIdξ))=(BμˆIdξ)1(30) (31) Blog(det(LˆIdξB))=(LˆIdξB)1.(31) Substituting (Equation29), (Equation30), and (Equation31) in (Equation28), (32) Blog(π(B;β))=ρW(BBˆk1)Wβ(BμˆIdξ)1+β(LˆIdξB)1.(32) Finally, substituting the gradient of the negative log likelihood (Equation27) and the gradient of the negative log prior (Equation32) in (Equation25), we obtain the gradient of the negative log posterior (33) BLk(B;β)=1ν=1kP(Bsy¯)s+ρW(BBˆk1)Wβ(BμˆIdξ)1+β(LˆIdξB)1.(33)

2.4. Directional derivative of the gradient of the negative log posterior

As discussed in Section 2.2, to find the posterior distribution maximizer using Newton-CG, we need to compute the directional derivative of the gradient of the negative log posterior in a given direction. Thus, for any symmetric matrix VRdξ×Rdξ, we derive δVBLk as (34) δVBLk(B;β)=δVBlog(p(S,Y|B))δVBlog(π(B;β)).(34) Following from (Equation27), (35) δVBlog(p(S,Y|B))=limϑ0Blogp(S,Y|B+ϑV)Blog(p(S,Y|B))ϑ(35) (36) =1νlimϑ0=1kP((B+ϑV)sy¯)s=1kP(Bsy¯)sϑ(36) (37) =1ν=1kPVss.(37) The directional derivative of the gradient of the negative log prior is derived from (Equation32), (38) δVBlog(π(B;β))=ρδV(W(BBˆk1)W)βδV(BμˆIdξ)1+βδV(LˆIdξB)1.(38) The Frobenius regularizer term from (Equation38) is derived as (39) δV(W(BBˆk1)W)=limϑ0W(B+ϑVBˆk1)WW(BBˆk1)Wϑ=WVW.(39)

Following the same rationale for the log-barrier terms, (40) δV(BμˆIdξ)1=limϑ0((B+ϑV)μˆIdξ)1(BμˆIdξ)1ϑ=limϑ0((BμˆIdξ)(Idξ+ϑ(BμˆIdξ)1V))1(BμˆIdξ)1ϑ=limϑ0(Idξ+ϑ(BμˆIdξ)1V)1(BμˆIdξ)1(BμˆIdξ)1ϑ=limϑ0(Idξϑ(BμˆIdξ)1V+O(ϑ2))(BμˆIdξ)1(BμˆIdξ)1ϑ=(BμˆIdξ)1V(BμˆIdξ)1,(40) and, (41) δV(LˆIdB)1=limϑ0((LˆId(B+ϑV)))1(LˆIdB)1ϑ=limϑ0((LˆIdB)(Idξϑ(LˆIdB)1V))1(LˆIdB)1ϑ=limϑ0(Idξϑ(LˆIdB)1V)1(LˆIdB)1(LˆIdB)1ϑ=limϑ0(Idξ+ϑ(LˆIdB)1V+O(ϑ2))(LˆIdB)1(LˆIdB)1ϑ=(LˆIdB)1V(LˆIdB)1.(41) Substituting (Equation39), (Equation40), and (Equation41) back into (Equation38), we get the directional derivative of the gradient of the negative log prior, (42) δVBlog(π(B;β))=ρWVW+β(BμˆIdξ)1V(BμˆIdξ)1+β(LˆIdξB)1V(LˆIdξB)1.(42) Finally, substituting (Equation37) and (Equation42) into (Equation34), we obtain the directional derivative of the gradient of the negative log posterior, (43) δVBLk(B;β)=1ν=1kPVss+ρWVW+β(BμˆIdξ)1V(BμˆIdξ)1+β(LˆIdξB)1V(LˆIdξB)1.(43)

2.5. Hessian approximation inverse

To use the Newton's method to solve (Equation9), one needs to either solve the linear system Bkdk=ξF(ξk), or find the inverse of the Hessian approximation Hk=Bk1, thus dk=HkξF(ξk).

To avoid inverting Bˆk, we propose to use the Newton–Raphson method to find an approximation Hˆk such that BˆkHˆkIdξFtolH; we can start the Newton–Raphson from the previous Hessian inverse approximation, thus saving a significant computational time. The Newton–Raphson method for this problem is given by (44) Hˆki+1=Hˆki(R(Hˆki))1R(Hˆki),(44) where R(H)=BkHIdξ and R(H)=Bk, thus, (45) Hˆki+1=(2IdξHˆkiBk)Hˆki.(45) The recursive procedure is repeated until R(Hˆki)F<tolH. Then, we set Hˆk:=Hˆki as an approximation of Hk in optimization. The convergence of the Newton–Raphson method depends on the initial guess of the inverse, Hˆk0; the method converges if the spectral norm of R(Hˆk0) is less than one [Citation23]. We start from the inverse estimate in the previous iteration, Hˆk0=Hˆk1. If the method does not converge, that is, if the residue increases, we reset Hˆki=2/(Lˆ+μˆ)Idξ. This starting point is similar to the one suggested by [Citation23] for Hermitian positive definite Bˆk, Hˆki=2/(λmax+λmin)Idξ, where λmax and λmin are, respectively, the largest and the smallest eigenvalues of Bˆk.

3. Preconditioned stochastic gradient descent methods

Instead of the common update of SGD as presented in (Equation2), we investigate the case where the stochastic gradient estimate is preconditioned with our Hessian inverse approximation, (46) ξk+1=ξkηkHˆkυk.(46) If the function to be minimized is strongly convex, SGD with fixed step size converges linearly to a neighbourhood of the optimum [Citation14].

Here, we expect the preconditioned SGD to converge to a neighbourhood of the optimum faster than vanilla SGD. One possible side effect of the preconditioning of the stochastic gradient estimate is that the noise in the gradient might be amplified [Citation18]. Therefore, it is fundamental to control the error in the gradient estimates using variance control techniques, e.g. MICE [Citation8], SVRG [Citation16], SARAH [Citation22]. Note that the variance control methods mentioned before are specializations of (Equation2), only differing in the way that they compute υk. Moreover, acknowledging the possibly prohibitive cost of updating the Hessian estimate every iteration, we update the Hessian after a given number of iterations or gradient evaluations.

Some variance control methods are based on computing differences between gradients at known points and are thus well-suited to be combined with our approach. The MICE algorithm [Citation8] builds a hierarchy of iterates, then progressively computes gradient differences between subsequent iterates in this hierarchy to control the gradient estimator error. Since both the differences between the iterates and between the gradients are available, curvature pairs s and y¯ can be obtained without any extra cost in terms of memory or gradient evaluations. If we refer to the algorithm in (Equation2) using MICE to obtain υk as SGD-MICE, preconditioning the gradient estimates by our inverse Hessian approximation as in (Equation46) furnishes the SGD-MICE-Bay method. For the finite sum case, other control variates methods are available, e.g. SVRG [Citation16], SARAH [Citation22], SAGA [Citation9], SAG [Citation25]. However, SVRG and SARAH have an advantage over SAGA and SAG here in that curvature pairs can be obtained directly from the algorithms without extra memory or processing costs. We refer to SVRG and SARAH preconditioned by our inverse Hessian approximation as SVRG-Bay and SARAH-Bay, respectively.

3.1. Convergence of preconditioned stochastic gradient descent

Assume that the relative statistical error of the gradient estimator is uniformly bounded above by a constant ϵ and that an arbitrary lower-bound μ~ is imposed on the eigenvalues of the Hessian approximation, i.e. μ~=ϵμ with ϵ1. On the one hand, a large value of ε decreases the noise amplification, as previously discussed, and at the same time improves the stability of the preconditioned stochastic gradient descent. On the other hand, if μ is indeed the smallest eigenvalue of the true Hessian matrix of F, imposing the constraint that the Hessian approximation must have a smallest eigenvalue μ~μ causes the Hessian approximation to differ from the true Hessian. In our analysis, we seek to consider the effect of the choice of parameters ε, ϵ, and the step size η.

Assumption 3.1

Regularity and convexity of the objective function

The objective function F has Lipschitz continuous second-order derivatives and is μ-convex. Namely, for all x,yΞ, (47) 2F(x)2F(y)Mxy(47) (48) F(y)F(x)+∇F(x)(yx)+μ2yx2.(48)

Assumption 3.2

Unbiasedness and bounded statistical error of gradient estimator

The gradient estimator υk is an unbiased estimator of ∇F(ξk) with an upper bound ϵ on its relative statistical error, (49) E[υk|ξk]=∇F(ξk)(49) (50) E[υkE[υk|ξk]2|ξk]ϵ2F(ξk)2.(50)

Assumption 3.3

Bounded error of Hessian approximation

The approximation Bkˆ of 2F(ξk) has a lower-bound μ~:=ϵμ on its eigenvalues with ϵ1 and the product of the true Hessian by the approximation inverse Hkˆ results in the identity matrix added by an error that is bounded above in norm by a linear function of ε, (51) Bkˆμ~Idξ:=ϵμ,ϵ1,(51) (52) 2F(ξk)Hkˆ=I+Ek,(52) (53) EkCϵ(ϵ1),(53) where Cϵ>0 is a constant.

Theorem 3.1

Convergence of preconditioned stochastic gradient descent

If Assumptions 3.13.2, and 3.3 are satisfied and (54) ∇F(ξk)<2ϵ2μ2(1ϵCϵ(1+ϵ)(ϵ1)rη)(1+ϵ2)(54) holds for all k>0, then the gradients of the iterates generated by the preconditioned stochastic gradient descent as described in (Equation46) converge linearly wirh rate 1−r, (55) ∇F(ξk+1)(1r)k+1∇F(ξ0).(55)

Proof.

From Assumption 3.1, we obtain an upper bound for the norm of the gradient at ξk+1, (56) ∇F(ξk+1)∇F(ξk)+2F(ξk)(ξk+1ξk)+M2ξk+1ξk2.(56) Using the preconditioned SGD update in (Equation46), one can further develop the equation above as (57) ∇F(ξk+1)∇F(ξk)η2F(ξk)(Hkˆυk)+Mη22Hkˆυk2∇F(ξk)η2F(ξk)Hkˆ∇F(ξk)+η2F(ξk)Hkˆ(υk∇F(ξk))+Mη22Hkˆυk2(57) (58) Idξη2F(ξk)Hkˆ∇F(ξk)+η2F(ξk)Hkˆυk∇F(ξk)+Mη22Hkˆ2υk2.(58) Taking the expectation conditioned on ξk and using Assumption 3.2, (59) E[∇F(ξk+1)|ξk]Idξη2F(ξk)Hkˆ∇F(ξk)+ηϵ2F(ξk)Hkˆ∇F(ξk)+Mη22(1+ϵ2)Hkˆ2F(ξk)2.(59) Using Assumption 3.3, (60) E[∇F(ξk+1)|ξk](Idξη(Idξ+Ek)+ηϵIdξ+Ek)∇F(ξk)+Mη22μ~2(1+ϵ2)F(ξk)2((1η)+ηEk+ηϵ(1+Ek))∇F(ξk)+Mη22μ~2(1+ϵ2)F(ξk)2(1η(1ϵ)+η(1+ϵ)Cϵ(ϵ1))∇F(ξk)+Mη22ϵ2μ2(1+ϵ2)F(ξk)2.(60) Letting the right hand side of the above inequality be strictly less than (1r)∇F(ξk) results in (Equation54). Taking expectation and unrolling the recursion complete the proof.

Theorem 3.1 shows that increasing the parameter ε that controls the smallest eigenvalue of the Hessian approximation improves the stability of preconditioned stochastic gradient descent. Moreover, controlling ε and the step size η to satisfy (Equation54) for a given r guarantees global convergence. By letting ϵ=0, thus resulting in deterministic gradient observations; ϵ=1, meaning that μ~=μ; and unitary step size η=1, one recovers the quadratic convergence of the Newton method with the condition that ∇F(ξk)<2μ2/M (see (Equation60)).

Since our analysis depends on quantities that are hard to characterize in practice, we do not use it to control ε and η. Instead, we start optimization with ϵ=(L+μ)/(2μ) and control the parameter ρ of the Frobenius regularization of the prior; the regularization term penalizes large changes in the Hessian approximation. Thus, at every Hessian approximation update, ε progressively decays. Figure  illustrates this behaviour for the Example in Section 4.1.

Figure 1. Value of μ~ versus iteration for SGD-MICE preconditioned with our Hessian inverse approximation applied to the minimization of the quadratic problem in example in Section 4.1.

Figure 1. Value of μ~ versus iteration for SGD-MICE preconditioned with our Hessian inverse approximation applied to the minimization of the quadratic problem in example in Section 4.1.

3.2. Time complexity

Given that the interval between Hessian updates is set to be O(dξ), then the extra cost of finding the Hessian approximation using our Bayesian approach is O(dξ2). The number of Newton iterations and conjugate gradient iterations do not depend on the problem's dimensionality but on the tolerance set for each algorithm. The cost of each conjugate gradient iteration is the cost of evaluating δVBLk(B,β) in (Equation43), which is O(dξ3) due to matrix multiplications. If, instead of using the conjugate gradient method to compute the Newton direction, one builds the fourth-order tensor of second-order derivatives of the log posterior with respect to the Hessian, the time complexity increases to O(dξ4).

In each iteration, we have to compute Pk, the precision matrix for each curvature pair. If a full matrix is computed, we have a complexity of O(dξ2) each iteration. If, however, we opt to ignore the off-diagonal terms, this complexity is improved to O(dξ). We can further decrease the memory overhead of our method by storing only a scalar, the trace of the covariance matrix, for each curvature pair. The reasoning behind this approach is that the trace of the covariance matrix works as a weight for the curvature pairs; in the curvature pairs where the gradient difference is better estimated, the respective curvature pair has a larger weight when computing the likelihood. When our Bayesian Hessian approximation is used with the MICE gradient estimator, the trace of the covariance matrix is available without any extra cost. In numerical tests, we observed that keeping the trace of the covariance matrix resulted in the best option regarding the balance between accuracy and computational time.

We do not discuss here the cost to compute υk satisfying condition (Equation50), as it depends on the method used to estimate it. For example, if υk is a Monte Carlo estimator with adaptive batch sizes, the cost of evaluating υk is O(ϵ2F(ξk)2). If the conditions of Theorem 1 are satisfied, the gradient sampling cost is O(ϵ2(1r)2k).

4. Numerical examples

In this section, we solve some numerical examples with and without our Bayesian approach to validate its performance. In all cases, we present the convergence of the optimality gap, F(ξk)F(ξ), and in some cases, the convergence of the central-path Newton-CG used to find the Hessian approximations. The parameters of the Bayesian approach are kept fixed for all the numerical examples, namely, the logarithmic barrier parameter, β=102; the Frobenius norm regularizer parameter, ρ=102; the tolerance on the norm of the gradient of the negative log posterior, tol=106; the constraint on the lower bound of the eigenvalues of B, μ~, is set as the strong-convexity parameter μ; the relaxation parameter, α=1.05; the number of central-path steps is six; and the factor of decrease of both toli and βi for each ith central-path step is γ=2. Moreover, to keep our Bayesian approach competitive in terms of computational time and memory allocation, we do not compute the full matrix P for each curvature pair, using the inverse of the trace of the covariance matrix instead. Note that the trace of the covariance matrix of the difference between gradients is available from MICE since it is used to control the relative error of the gradient [Citation8].

The results presented in this section are obtained using a python implementation of the proposed method whose source code can be found at GitHub.Footnote1 Moreover, the source code of the MICE gradient estimator, also used in this section, can be found at Bitbucket.Footnote2 A repository with all the code necessary to reproduce the results presented in this section is available at GitHub.Footnote3

4.1. Quadratic function

The first numerical test is a quadratic function with noise added to its gradient. We want to minimize Eθ[f(ξ,θ)|ξ] with (61) f(ξ,θ)=12ξAξbξ+ξΣθ,(61) where A is a square symmetric positive-definite matrix with 1Aκ, Σ is a symmetric and positive definite covariance matrix, θ is a vector of independent standard normal distributed random variables, and the vector b is defined as b=1dξ, where 1dξ is a vector of ones of dimension dξ. The eigenvalues of A are sampled uniformly between 1 and κ, with necessarily one eigenvalue being 1 and another being κ. The gradient of f with respect to ξ is (62) ∇f(ξ,θ)=Aξb+Σθ,(62) thus, the covariance matrix of ∇f is Σ.

We solve this problem for two different values of the condition number κ, namely 103 and 106, and for a fixed number of dimensions dξ=10. We compare the performances of different methods in the solution of this problem: SGD, SGD-MICE, SGD-Bay with decreasing step size, SGD-Bay with fixed step size, and SGD-MICE-Bay. In all cases, we run the optimization until 107 gradients are evaluated. For SGD-Bay, since the number of gradient evaluations per iteration is fixed, we know the number of iterations a priori and thus choose the interval between Hessian updates to have 15 equally distributed updates. For SGD-MICE-Bay, we opt to have an update every dξ iterations. The step sizes and batch sizes used for each method are presented in Table , where, for the cases when MICE is used, we show the tolerance ϵ on the relative statistical error of the gradient.

Table 1. Step sizes and batch sizes used for Example 4.1.

In Figure , we present the optimality gap versus iteration for both κ=103 (left) and κ=106 (right) for the previously mentioned methods. In the cases where our Bayesian approach is used, the Hessian updates are illustrated as coloured squares with black edges. For the case where κ=1000, SGD-MICE-Bay achieved, in 39 iterations, a lower optimality gap than SGD in 107 iterations. Moreover, SGD presents oscillations around the optimum, making it difficult to have a reliable optimum estimate. In the case of SGD-Bay, with and without the decreasing step size, we observe oscillations of larger amplitude due to the preconditioning with our Hessian approach. This behaviour is due to the lack of variance control of SGD; not even the decreasing step size is enough to avoid the oscillations. In the case of an even larger condition number, κ=106, SGD-MICE-Bay far outperforms the other methods while requiring 178 iterations. Both SGD and SGD-MICE, being first-order methods, cannot improve the optimality gap significantly once close to the optimum. As for SGD-Bay, the same oscillating behaviour is observed for both the fixed and decreasing step cases. From these results, it is clear that, in large condition numbers cases, using second-order information is necessary for an efficient convergence. Moreover, we observe that controlling the variance of the gradient estimator is of central importance to maintaining convergence in the stochastic setting, avoiding the amplification of noise due to the preconditioning.

Figure 2. Convergence of the optimality gap versus iteration with κ=1000 (left) and κ=106 (right) for Example 4.1. The Hessian updates are marked with coloured squares with black edges. In both cases, SGD-MICE-Bay reduced the optimality gap more than the other methods and in a much smaller number of iterations. Moreover, SGD-MICE-Bay did not oscillate around the optimum like both SGD-Bay cases.

Figure 2. Convergence of the optimality gap versus iteration with κ=1000 (left) and κ=106 (right) for Example 4.1. The Hessian updates are marked with coloured squares with black edges. In both cases, SGD-MICE-Bay reduced the optimality gap more than the other methods and in a much smaller number of iterations. Moreover, SGD-MICE-Bay did not oscillate around the optimum like both SGD-Bay cases.

One crucial aspect of our method is the cost to update the Hessian approximation, i.e. the cost of the Newton-CG method in this setting. For SGD-MICE-Bay in the κ=1000 case, we present in the Newton-CG convergence of the gradient norm BL per Newton iteration. The black numbers on top of each update denote the number of conjugate gradient steps needed to find the Newton direction. Each colour represents a different central-path step, i.e. at each colour change, both β and tol are decreased by a factor γ=2, with each tol denoted by a dash-dotted line. For this problem, the runtime of each of the Hessian updates is at fractions of seconds. Given that dξ=10, the Hessian matrix has 100 components. Still, in the worst case, 51 conjugate gradient iterations were needed to find the Newton direction, showing that the problem of finding Bˆ is well-conditioned. Moreover, the total number of Newton iterations in the three cases did not exceed 25.

Figure 3. For each of the three Hessian updates of SGD-MICE-Bay with κ=1000 in Example 4.1, we present the convergence of the gradient norm for the Newton-CG method. The number of curvature pairs available and the time taken are presented for each Hessian update. Each colour represents a different step of the central-path method with a different logarithmic barrier parameter β and a different residue tolerance. The tolerance for each central-path step is presented as a dash-dotted line. The number above each Newton iteration represents the number of conjugate gradient iterations needed to find the Newton direction.

Figure 3. For each of the three Hessian updates of SGD-MICE-Bay with κ=1000 in Example 4.1, we present the convergence of the gradient norm for the Newton-CG method. The number of curvature pairs available and the time taken are presented for each Hessian update. Each colour represents a different step of the central-path method with a different logarithmic barrier parameter β and a different residue tolerance. The tolerance for each central-path step is presented as a dash-dotted line. The number above each Newton iteration represents the number of conjugate gradient iterations needed to find the Newton direction.

Controlling the eigenvalues maxima and minima is essential to keep the stability and efficiency of preconditioned SGD. To motivate why our approach is better suited to stochastic optimization than using the BFGS formulae to approximate the Hessian, we, for every Hessian update of SGD-MICE-Bay, also compute a Hessian approximation using BFGS. In Figure , we present, for both κ=1000 (left) and κ=106 (right), the extreme eigenvalues for our Bayesian approach, for BFGS, and the true extremes. The same curvature pairs are used for both our Bayesian approach and BFGS. For both condition number cases, the BFGS method can keep the minimum eigenvalue above the actual value; however, the largest eigenvalue far exceeds the true largest eigenvalue. In contrast, our Bayesian approach keeps the eigenvalues of the Hessian approximation between the extreme values due to the logarithmic barrier constraints.

Figure 4. Maximum and minimum eigenvalues for both our Bayesian approach and BFGS for Example 4.1 with κ=1000 (left) and κ=106 (right). We use the same curvature pairs for both cases, obtained from SGD-MICE-Bay. The true extremes of the eigenvalues are presented as black lines. Dashed lines represent the largest eigenvalues, and dash-dotted lines represent the smallest eigenvalues. It is clear that our approach gets closer to the actual extreme eigenvalues than BFGS, which is farther from the smallest eigenvalue than our method and exceeds the value of the largest eigenvalue.

Figure 4. Maximum and minimum eigenvalues for both our Bayesian approach and BFGS for Example 4.1 with κ=1000 (left) and κ=106 (right). We use the same curvature pairs for both cases, obtained from SGD-MICE-Bay. The true extremes of the eigenvalues are presented as black lines. Dashed lines represent the largest eigenvalues, and dash-dotted lines represent the smallest eigenvalues. It is clear that our approach gets closer to the actual extreme eigenvalues than BFGS, which is farther from the smallest eigenvalue than our method and exceeds the value of the largest eigenvalue.

As motivated in (Equation6) and Theorem 3.1, increasing μˆ decreases the noise amplification of preconditioning the gradients by our Bayesian Hessian approximation. Figure  presents a comparison between SGD-Bay with a fixed step size for different values of μˆ. One thing to notice is the reduction of the amplitude of the oscillations as μˆ increases. Moreover, we observe a clear linear convergence for the more conservative cases of μˆ=1000 and μˆ=10,000. Comparing these results with the ones presented in Figure for the same condition number κ=106, it is clear that increasing μˆ improves the performance of SGD-Bay even without controlling the gradient statistical error or decreasing the step size. However, in the general convex case, SGD-Bay with a fixed step size will converge to a neighbourhood of the optimum and oscillate around it with a fixed amplitude; increasing μˆ or decreasing the step size may reduce the amplitude of the oscillations but not circumvent them.

Figure 5. Optimality gap versus iteration for SGD-Bay with fixed step size on Example 4.1 for different values of μˆ. The coloured squares with black edges represent the Hessian updates of each method. The different values of μˆ result in different amplitudes for the oscillations around the optimum of SGD-Bay; larger values of μˆ result in smaller amplitudes of oscillations (cf. (Equation6)).

Figure 5. Optimality gap versus iteration for SGD-Bay with fixed step size on Example 4.1 for different values of μˆ. The coloured squares with black edges represent the Hessian updates of each method. The different values of μˆ result in different amplitudes for the oscillations around the optimum of SGD-Bay; larger values of μˆ result in smaller amplitudes of oscillations (cf. (Equation6(6) E[‖Bˆk−1υk−E[Bˆk−1υk|Bˆk−1]‖2|Bˆk−1]≤‖Bˆk−1‖2E[‖υk−E[υk]‖2|Bˆk−1],(6) )).

4.2. Logistic regression

Training logistic regression models for classification of finite data is an empirical risk minimization problem, thus a problem of minimizing a sum of functions. When training a model on a large dataset, it is common to use SGD to reduce the iteration cost by subsampling the data points used in each iteration. Since inexact information about the gradient is used in each iteration, deterministic quasi-Newton methods usually fail to recover the Hessian and its inverse. Here, we use our Bayesian framework to recover the Hessian from noisy gradient observations and precondition the gradient estimates with the Hessian inverse. We compare plain SGD, SGD-MICE, SGD-MICE-Bay, SVRG, SVRG-Bay, SARAH, and SARAH-Bay on the task of training a binary classification logistic regression model with 2 regularization on five different datasets, mushrooms, ijcnn1, w8a [Citation17], cod-rna [Citation29], and HIGGS [Citation1].Footnote4

The 2-regularized log loss function we seek to minimize is (63) F(ξ)=1Ni=1Nf(ξ,θi=(xi,yi))=1Ni=1Nlog(1+exp(yiξxi))+λ2ξ2,(63) where each data point (xi,yi) is such that xiRdξ and yi{1,1}. For all datasets, we use a regularization parameter λ=105.

For SGD, only a singleton sample is used, i.e. the minibatch size is set to 1. Also, as in the example in Section 4.1, a decreasing schedule is used for the step size. For SGD-MICE, the sample size is adaptively controlled to keep the relative error of the gradient estimates below ϵ=0.5 with a step size of η=2/((L+μ)(1+ϵ2)) [Citation8]. For the HIGGS dataset, the choice of ϵ=0.5 was too conservative, thus we used ϵ=0.8. As for SGD-MICE-Bay, we used the same values of ϵ, however, since we precondition the gradient estimates with the inverse of the Hessian, we use a step size of η=(1+ϵ2)1. Therefore, with the initial Hessian guess with diagonal entries Bii=L+μ2, SGD–MICE-Bay recovers exactly SGD-MICE with the optimal step size. Similarly, for SVRG and SARAH, we use a fixed step size of ηk=0.1/L, whereas for SVRG-Bay and SARAH-Bay we use ηk=0.1. For SVRG, SVRG-Bay, SARAH, and SARAH-Bay, we start with a full batch, iterate using a mini-batch size of five, and restart every two epochs.

In Table , we present the sizes, number of features, and condition numbers of the resulting optimization problems for the five datasets. The used datasets differ significantly in their characteristics.

Table 2. For each of the datasets used in Example 4.2, we present its data size, number of features, and condition number of the optimization problem.

In Figure , we present the convergence for the optimality gap versus iterations (left) and versus runtime in seconds (right) for the mushrooms dataset. For SGD, SGD-MICE, SVRG, and SARAH, we present their convergences as solid lines and their respective preconditioned cases as dashed lines. For all the methods, preconditioning the gradient estimates using our Bayesian approach improved the final optimality gap in at least one order of magnitude. In terms of convergence of the optimality gap versus runtime, we observe that computing the Hessian approximation adds an overhead to the optimization process; however, the improvement in convergence compensates for this overhead. In this example, SGD-MICE-Bay achieved an optimality gap ten times smaller than that found by SGD-MICE and more than two orders of magnitude smaller than the one found by SGD. Preconditioning the gradient estimates of SVRG and SARAH with our Bayesian Hessian inverse approximation reduced the final optimality gap found by more than one order of magnitude.

Figure 6. Convergence of the optimality gap versus iteration (left) and runtime (right) for Example 4.2, mushrooms dataset. The Hessian updates are marked with coloured squares with black edges.

Figure 6. Convergence of the optimality gap versus iteration (left) and runtime (right) for Example 4.2, mushrooms dataset. The Hessian updates are marked with coloured squares with black edges.

In Figure , the convergence of the Newton-CG method is presented for each of the Hessian updates of SGD-MICE-Bay for the mushrooms dataset. Note that a relatively small number of Newton steps are needed for each central-path step; 8 in the worst case. Moreover, the number of conjugate gradient steps does not exceed 18, which is a remarkably small number of iterations considering that we have a problem with 112×112 variables. This result indicates that the problem is well-conditioned.

Figure 7. Convergence of the Newton-CG method for each of the Hessian updates of SGD-MICE-Bay in Example 4.2, mushrooms dataset. The number of curvature pairs available and the time taken are presented for each Hessian update. Each colour represents a different step of the central-path method with a different logarithmic barrier parameter β and a different residue tolerance. The tolerance for each central-path step is presented as a dash-dotted line. The number above each Newton iteration represents the number of conjugate gradient iterations needed to find the Newton direction.

Figure 7. Convergence of the Newton-CG method for each of the Hessian updates of SGD-MICE-Bay in Example 4.2, mushrooms dataset. The number of curvature pairs available and the time taken are presented for each Hessian update. Each colour represents a different step of the central-path method with a different logarithmic barrier parameter β and a different residue tolerance. The tolerance for each central-path step is presented as a dash-dotted line. The number above each Newton iteration represents the number of conjugate gradient iterations needed to find the Newton direction.

Figure  presents the convergence for the ijcnn1 dataset. Here, as in the case of the mushrooms dataset, our Bayesian approach improved the convergence in all cases. However, the most remarkable improvement is the one of SVRG-Bay, which achieved an optimality gap of more than two orders of magnitude smaller than the vanilla SVRG with just one Hessian update. Both SVRG-Bay and SGD-MICE-Bay achieved very similar values of optimality gap, with SVRG-Bay requiring less runtime than SGD-MICE-Bay. In Figure , we present the convergence of Newton-CG for the one Hessian update of SGD-MICE-Bay for the ijcnn1 dataset. Note that the time spent computing the Hessian approximation is 0.33 seconds; for comparison, computing the true Hessian took 0.54 seconds in an average of 100 evaluations.

Figure 8. Convergence of the optimality gap versus iteration (left) and runtime (right) for Example 4.2, ijcnn1 dataset. The Hessian updates are marked with coloured squares with black edges.

Figure 8. Convergence of the optimality gap versus iteration (left) and runtime (right) for Example 4.2, ijcnn1 dataset. The Hessian updates are marked with coloured squares with black edges.

Figure 9. Convergence of the Newton-CG method for each of the Hessian updates of SGD-MICE-Bay in Example 4.2, ijcnn1 dataset. The number of curvature pairs available and the time taken are presented for each Hessian update. Each colour represents a different step of the central-path method with a different logarithmic barrier parameter β and a different residue tolerance. The tolerance for each central-path step is presented as a dash-dotted line. The number above each Newton iteration represents the number of conjugate gradient iterations needed to find the Newton direction.

Figure 9. Convergence of the Newton-CG method for each of the Hessian updates of SGD-MICE-Bay in Example 4.2, ijcnn1 dataset. The number of curvature pairs available and the time taken are presented for each Hessian update. Each colour represents a different step of the central-path method with a different logarithmic barrier parameter β and a different residue tolerance. The tolerance for each central-path step is presented as a dash-dotted line. The number above each Newton iteration represents the number of conjugate gradient iterations needed to find the Newton direction.

The w8a dataset is the case with the largest number of features, 300, meaning that the Hessian to be approximated by our Bayesian approach has 300×300 components. The convergence for the w8a dataset is presented in Figure . Coupling SVRG, SARAH, and SGD-MICE with our Bayesian approach improved the convergence in all cases, but it was more significant for SVRG and SARAH. However, in terms of runtime, SGD-MICE was able to perform better than its competitors, and SGD-MICE-Bay had a small edge over SGD-MICE. Figure  presents the convergence of Newton-CG for SGD-MICE-Bay on the w8a dataset. As in the case of the mushrooms dataset, a small number of Newton steps (a maximum of 24) and conjugate gradient steps (a maximum of 34) were needed despite the large dimensionality of the problem at 90,000 variables. Even with the large Hessian, the time spent to find an approximation (7.32 s in the first update and 15.9 s in the second) is competitive with the time of computing the true Hessian, of 8.52 s in an average of 100 evaluations. Remember that our approach has stability advantages over using the true Hessian inverse as a preconditioner due to the control of the noise amplification given by the logarithmic barrier on the smallest eigenvalue.

Figure 10. Convergence of the optimality gap versus iteration (left) and runtime (right) for Example 4.2, w8a dataset. The Hessian updates are marked with coloured squares with black edges.

Figure 10. Convergence of the optimality gap versus iteration (left) and runtime (right) for Example 4.2, w8a dataset. The Hessian updates are marked with coloured squares with black edges.

Figure 11. Convergence of the Newton-CG method for each of the Hessian updates of SGD-MICE-Bay in Example 4.2, w8a dataset. The number of curvature pairs available and the time taken are presented for each Hessian update. Each colour represents a different step of the central-path method with a different logarithmic barrier parameter β and a different residue tolerance. The tolerance for each central-path step is presented as a dash-dotted line. The number above each Newton iteration represents the number of conjugate gradient iterations needed to find the Newton direction.

Figure 11. Convergence of the Newton-CG method for each of the Hessian updates of SGD-MICE-Bay in Example 4.2, w8a dataset. The number of curvature pairs available and the time taken are presented for each Hessian update. Each colour represents a different step of the central-path method with a different logarithmic barrier parameter β and a different residue tolerance. The tolerance for each central-path step is presented as a dash-dotted line. The number above each Newton iteration represents the number of conjugate gradient iterations needed to find the Newton direction.

Training the logistic regression model to the cod-rna dataset is the problem with the largest condition number of those studied here, at 6.48×109. We expect our Bayesian approach to be able to assist the optimization methods in converging to the true solution despite the large condition number of the problem. The convergence of the optimality gap for the cod-rna dataset is presented in Figure . Note that SVRG, SARAH, and SGD-MICE get stuck after some iterations, improving very little for most of the optimization process. Their preconditioned counterparts, however, perform much better, being able to decrease the optimality gap continuously. When looking at the convergence of the optimality gap versus runtime for SGD-MICE-Bay, we observe a sharp decrease after the first Hessian update that follows until its stop. Figure  presents the convergence of Newton-CG for the four Hessian updates of SGD-MICE-Bay. In the second Hessian update, the third central-path step, the Newton-CG method got temporarily stuck for 12 iterations, leading us to believe that the parameters β and/or ρ were not properly set for this example. Still, the Newton-CG method could resume the Hessian approximation in a reasonable time, at 1.89 s, showing the robustness of our approach.

Figure 12. Convergence of the optimality gap versus iteration (left) and runtime (right) for Example 4.2, cod-rna dataset. The Hessian updates are marked with coloured squares with black edges.

Figure 12. Convergence of the optimality gap versus iteration (left) and runtime (right) for Example 4.2, cod-rna dataset. The Hessian updates are marked with coloured squares with black edges.

Figure 13. Convergence of the Newton-CG method for each of the Hessian updates of SGD-MICE-Bay in Example 4.2, cod-rna dataset. The number of curvature pairs available and the time taken are presented for each Hessian update. Each colour represents a different step of the central-path method with a different logarithmic barrier parameter β and a different residue tolerance. The tolerance for each central-path step is presented as a dash-dotted line. The number above each Newton iteration represents the number of conjugate gradient iterations needed to find the Newton direction.

Figure 13. Convergence of the Newton-CG method for each of the Hessian updates of SGD-MICE-Bay in Example 4.2, cod-rna dataset. The number of curvature pairs available and the time taken are presented for each Hessian update. Each colour represents a different step of the central-path method with a different logarithmic barrier parameter β and a different residue tolerance. The tolerance for each central-path step is presented as a dash-dotted line. The number above each Newton iteration represents the number of conjugate gradient iterations needed to find the Newton direction.

The HIGGS dataset contains 11×106 data points, being the largest studied here. Thus, since our Hessian approximation is independent of the data size, we expect to have a time advantage compared to computing the true Hessian. Figure  presents the convergence of the optimality gap for the HIGGS dataset. Here we observe that, in contrast to the previous results, our Bayesian approach worsened the performance of SARAH and SVRG. For SGD-MICE, however, our method improved the final optimality gap by more than one order of magnitude. Also, concerning runtime, SGD-MICE-Bay performed better than the other methods, despite doing 9 Hessian updates. The longest Hessian update was the last one, lasting 2.36 s. In comparison, the average time taken to compute a Hessian for this example is 130.58 s. One possible interpretation for the bad performance of SVRG-Bay and SARAH-Bay is that, contrary to SGD-MICE, SVRG and SARAH do not control the relative statistical error in the gradient estimates.

Figure 14. Convergence of the optimality gap versus iteration (left) and runtime (right) for Example 4.2, HIGGS dataset. The Hessian updates are marked with coloured squares with black edges.

Figure 14. Convergence of the optimality gap versus iteration (left) and runtime (right) for Example 4.2, HIGGS dataset. The Hessian updates are marked with coloured squares with black edges.

5. Conclusion

We presented a Bayesian approach to approximate Hessians of functions given noisy observations of their gradients. This problem is of great importance in stochastic optimization, where pre-conditioning gradient estimates with the inverse of the Hessian matrix can improve the convergence of stochastic gradient descent methods. The proposed Bayesian approach takes into consideration not only the secant equations as in deterministic quasi-Newton methods but also the noise in the gradients. To mitigate the known effect of amplification of the gradient noise, we control the smallest eigenvalue of the Hessian approximation. To maximize the log-posterior of the Hessian matrix, we use a Newton-CG method with a central-path approach to impose the eigenvalue constraints. The numerical results presented show that our approach is not only interesting from the theoretical perspective but also results in practical advantage. In both a stochastic quadratic equation and in the training of a logistic regression model with 2 regularization, using our Bayesian approximation of the Hessian matrix resulted in better convergence rates without a significant increase in runtime. For future research, we suggest using low-rank or diagonal approximations of the Hessians.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was partially supported by the KAUST Office of Sponsored Research (OSR) under Award number OSR-2019-CRG8-4033 in the KAUST Competitive Research Grants Program Round 8, the Alexander von Humboldt Foundation, and the Flexible Interdisciplinary Research Collaboration Fund at the University of Nottingham Project ID 7466664.

Notes on contributors

André Gustavo Carlon

Dr André Gustavo Carlon completed his Bachelor's degree in Civil Engineering at the Universidade Federal de Santa Catarina (UFSC) in Brazil in 2013. He then pursued his Master's and PhD degrees at the same institution, defending his Master's thesis on structural optimization in 2015 and his Ph.D. dissertation on uncertainty quantification in 2019. Dr. Carlon has since been a postdoctoral fellow in the stochastic numerics group at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia. His research focuses on stochastic optimization, structural optimization, reliability engineering, and optimal experimental design.

Luis Espath

Dr Luis Espath obtained his Engineering Diploma from the Pontifical Catholic University of Rio Grande do Sul (PUCRS) in Brazil in 2007. He then completed both his Master's and Doctoral degrees at the Federal University of Rio Grande do Sul (UFRGS) in Brazil, concluding his studies in 2013. Dr Espath served as a postdoctoral fellow at King Abdullah University of Science and Technology (KAUST) until 2019. He then joined RWTH Aachen University in Germany as a Research Scientist, where he defended his Habilitation thesis in Mathematics in the field of Theoretical Mechanics in 2022. His research interests include theoretical and computational mechanics, uncertainty quantification, stochastic optimization, and machine learning.

Raúl Tempone

Dr Raúl Tempone received his PhD from the Royal Institute of Technology in Stockholm, Sweden, in 2002. He subsequently held postdoctoral positions at the Institute for Computational Engineering and Sciences (ICES) at the University of Texas at Austin and served as an Assistant Professor at Florida State University in Tallahassee. Since 2009, Dr. Tempone has been a Full Professor at King Abdullah University of Science and Technology (KAUST) in Thuwal, Saudi Arabia. In 2018, he was awarded the Alexander von Humboldt Professorship at RWTH Aachen University in Germany. Dr. Tempone's research focuses on numerical analysis, specifically the development and analysis of efficient and robust numerical methods for problems involving stochastic models and differential equations in various fields, including computational mechanics, quantitative finance, biological and chemical modeling, and wireless communications.

Notes

Unknown widget #5d0ef076-e0a7-421c-8315-2b007028953f

of type scholix-links

References

  • P. Baldi, P. Sadowski, and D. Whiteson, Searching for exotic particles in high-energy physics with deep learning, Nat. Commun. 5 (2014), pp. 1–9.
  • R. Bollapragada, J. Nocedal, D. Mudigere, H.J. Shi, and P.T.P. Tang, A progressive batching L-BFGS method for machine learning, in International Conference on Machine Learning, PMLR, 2018, pp. 620–629.
  • A. Bordes, L. Bottou, and P. Gallinari, Sgd-qn: Careful quasi-Newton stochastic gradient descent, J. Mach. Learn. Res. 10 (2009), pp. 1737–1754.
  • L. Bottou, F.E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev. 60 (2018), pp. 223–311.
  • S. Boyd, S.P. Boyd, and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004.
  • R.H. Byrd, G.M. Chin, W. Neveitt, and J. Nocedal, On the use of stochastic hessian information in optimization methods for machine learning, SIAM. J. Optim. 21 (2011), pp. 977–995.
  • R.H. Byrd, S.L. Hansen, J. Nocedal, and Y. Singer, A stochastic quasi-Newton method for large-scale optimization, SIAM. J. Optim. 26 (2016), pp. 1008–1031.
  • A. Carlon, L. Espath, R. Lopez, and R. Tempone, Multi-iteration stochastic optimizers, preprint (2020). Available at arXiv.
  • A. Defazio, F. Bach, and S. Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems, Montreal, QC, Canada, 2014, pp. 1646–1654.
  • J.E. Dennis Jr and J.J. Moré, Quasi-Newton methods, motivation and theory, SIAM Rev. 19 (1977), pp. 46–89.
  • D. Dua and C. Graff, UCI machine learning repository (2022). Available at http://archive.ics.uci.edu/ml.
  • D. Goldfarb, Y. Ren, and A. Bahamou, Practical quasi-Newton methods for training deep neural networks, Adv. Neural. Inf. Process. Syst. 33 (2020), pp. 2386–2396.
  • R. Gower, D. Goldfarb, and P. Richtárik, Stochastic block BFGS: Squeezing more curvature out of data, in International Conference on Machine Learning, PMLR, 2016, pp. 1869–1878.
  • R.M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik, SGD: General analysis and improved rates, in International Conference on Machine Learning, PMLR, 2019, pp. 5200–5209.
  • P. Hennig, Fast probabilistic optimization from noisy gradients, in International Conference on Machine Learning, PMLR, 2013, pp. 62–70.
  • R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Adv. Neural. Inf. Process. Syst. 26 (2013), pp. 315–323.
  • R. Kohavi, Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid, in KDD, Portland, OR, USA, Vol. 96, 1996, pp. 202–207.
  • X.L. Li, Preconditioned stochastic gradient descent, IEEE. Trans. Neural. Netw. Learn. Syst. 29 (2017), pp. 1454–1466.
  • A. Mokhtari and A. Ribeiro, Stochastic quasi-Newton methods, Proc. IEEE 108 (2020), pp. 1906–1922.
  • P. Moritz, R. Nishihara, and M. Jordan, A linearly-convergent stochastic L-BFGS algorithm, in Artificial Intelligence and Statistics, Cadiz, Spain, 2016, pp. 249–258.
  • Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Vol. 87, Springer Science & Business Media, Cham, Switzerland, 2013.
  • L.M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč, SARAH: A novel method for machine learning problems using stochastic recursive gradient, in International Conference on Machine Learning, PMLR, 2017, pp. 2613–2621.
  • V. Pan and J. Reif, Efficient parallel solution of linear systems, in Proceedings of the Seventeenth Annual ACM Symposium on Theory of Computing, Providence, RI, USA, 1985, pp. 143–152.
  • B.T. Polyak, Introduction to Optimization. Optimization Software, Vol. 1, Inc., Publications Division, New York, 1987.
  • M. Schmidt, N. Le Roux, and F. Bach, Minimizing finite sums with the stochastic average gradient, Math. Program. 162 (2017), pp. 83–112.
  • N.N. Schraudolph, J. Yu, and S. Günter, A stochastic quasi-Newton method for online convex optimization, in Artificial Intelligence and Statistics, PMLR, 2007, pp. 436–443.
  • H.J.M. Shi, Y. Xie, R. Byrd, and J. Nocedal, A noise-tolerant quasi-Newton algorithm for unconstrained optimization, SIAM. J. Optim. 32 (2022), pp. 29–55.
  • J. Sohl-Dickstein, B. Poole, and S. Ganguli, Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods, in International Conference on Machine Learning, Beijing, China, 2014, pp. 604–612.
  • A.V. Uzilov, J.M. Keegan, and D.H. Mathews, Detection of non-coding rnas on the basis of predicted secondary structure formation free energy change, BMC. Bioinformatics. 7 (2006), pp. 1–30.
  • X. Wang, S. Ma, D. Goldfarb, and W. Liu, Stochastic quasi-Newton methods for nonconvex stochastic optimization, SIAM. J. Optim. 27 (2017), pp. 927–956.
  • Y. Xie, R. Byrd, and J. Nocedal, Analysis of the BFGS method with errors, preprint (2019). Available at arXiv, arXiv:1901.09063.