{"title": "Stochastic Cubic Regularization for Fast Nonconvex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2899, "page_last": 2908, "abstract": "This paper proposes a stochastic variant of a classic algorithm---the cubic-regularized Newton method [Nesterov and Polyak]. The proposed algorithm efficiently escapes saddle points and finds approximate local minima for general smooth, nonconvex functions in only $\\mathcal{\\tilde{O}}(\\epsilon^{-3.5})$ stochastic gradient and stochastic Hessian-vector product evaluations. The latter can be computed as efficiently as stochastic gradients. This improves upon the $\\mathcal{\\tilde{O}}(\\epsilon^{-4})$ rate of stochastic gradient descent. Our rate matches the best-known result for finding local minima without requiring any delicate acceleration or variance-reduction techniques.", "full_text": "Stochastic Cubic Regularization for Fast Nonconvex\n\nOptimization\n\nNilesh Tripuraneni\u21e4 Mitchell Stern\u21e4 Chi Jin\n\nJeffrey Regier Michael I. Jordan\n\n{nilesh_tripuraneni,mitchell,chijin,regier}@berkeley.edu\n\njordan@cs.berkeley.edu\n\nUniversity of California, Berkeley\n\nAbstract\n\nThis paper proposes a stochastic variant of a classic algorithm\u2014the cubic-\nregularized Newton method [Nesterov and Polyak, 2006]. The proposed algorithm\nef\ufb01ciently escapes saddle points and \ufb01nds approximate local minima for general\nsmooth, nonconvex functions in only \u02dcO(\u270f3.5) stochastic gradient and stochastic\nHessian-vector product evaluations. The latter can be computed as ef\ufb01ciently as\nstochastic gradients. This improves upon the \u02dcO(\u270f4) rate of stochastic gradient\ndescent. Our rate matches the best-known result for \ufb01nding local minima without\nrequiring any delicate acceleration or variance-reduction techniques.\n\n1\n\nIntroduction\n\nWe consider the problem of nonconvex optimization in the stochastic approximation framework\n[Robbins and Monro, 1951]:\n\nmin\nx2Rd\n\nf (x) := E\u21e0\u21e0D[f (x; \u21e0)].\n\n(1)\n\nIn this setting, we only have access to the stochastic function f (x; \u21e0), where the random variable \u21e0 is\nsampled from an underlying distribution D. The task is to optimize the expected function f (x), which\nin general may be nonconvex. This framework covers a wide range of problems, including the of\ufb02ine\nsetting where we minimize the empirical loss over a \ufb01xed amount of data, and the online setting\nwhere data arrives sequentially. One of the most prominent applications of stochastic optimization\nhas been in large-scale statistics and machine learning problems, such as the optimization of deep\nneural networks.\nClassical analysis in nonconvex optimization only guarantees convergence to a \ufb01rst-order stationary\npoint (i.e., a point x satisfying krf (x)k = 0), which can be a local minimum, a local maximum,\nor a saddle point. This paper goes further, proposing an algorithm that escapes saddle points and\nconverges to a local minimum. A local minimum is de\ufb01ned as a point x satisfying krf (x)k = 0\nand r2f (x) \u232b 0. Finding such a point is of special interest for a large class of statistical learning\nproblems where local minima are global or near-global solutions (e.g. Choromanska et al. [2015],\nSun et al. [2016a,b], Ge et al. [2017]).\nAmong \ufb01rst-order stochastic optimization algorithms, stochastic gradient descent (SGD) is perhaps\nthe simplest and most versatile. While SGD is computationally inexpensive, the best current guarantee\nfor \ufb01nding an \u270f-approximate local minimum (see De\ufb01nition 1) requires O(\u270f4poly(d)) iterations\n[Ge et al., 2015], which is inef\ufb01cient in the high-dimensional regime.\n\n\u21e4Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn contrast, second-order methods which have access to the Hessian of f can exploit curvature to more\neffectively escape saddles and arrive at local minima. However, constructing the full Hessian can be\nprohibitively expensive in high-dimensions. Thus, recent work has explored the use of Hessian-vector\nproducts r2f (x) \u00b7 v, which can be computed as ef\ufb01ciently as gradients in many cases including\nneural networks [Pearlmutter, 1994].\nAmong second-order algorithms, one of the most natural extensions of the gradient descent algorithm\nis the cubic-regularized Newton method of Nesterov and Polyak [2006]. Whereas gradient descent\n\ufb01nds the minimizer of a local second-order Taylor expansion at each step,\n\n\uf8fff (xt) + rf (xt)>(x xt) +\n\n`\n\n2kx xtk2 ,\n\nxGD\nt+1 = argmin\n\nx\n\nthe cubic regularized Newton method \ufb01nds the minimizer of a local third-order Taylor expansion,\n\nxCubic\nt+1 = argmin\n\nx\n\nhf (xt) + rf (xt)>(x xt) +\n\n1\n2\n\n(x xt)>r2f (xt)(x xt) +\n\n\u21e2\n\n6kx xtk3i.\n\nMost previous work on the cubic-regularized Newton method has focused on the non-stochastic or\npartially-stochastic settings. This leads us to ask the central questions of this paper: Can we design\na fully stochastic variant of the cubic-regularized Newton method? If so, is such an algorithm\nfaster than SGD?\nIn this work, we answer both questions in the af\ufb01rmative, bridging the gap between its use in the\nnon-stochastic and stochastic settings.\nSpeci\ufb01cally, we propose a stochastic variant of the cubic-regularized Newton method. We pro-\nvide a non-asymptotic analysis of its complexity, showing that the proposed algorithm \ufb01nds an\n\u270f-second-order stationary point using only \u02dcO(\u270f3.5) stochastic gradient and stochastic Hessian-vector\nevaluations, where \u02dcO(\u00b7) hides poly-logarithmic factors.1 Our rate improves upon the \u02dcO(\u270f4) rate of\nstochastic gradient descent, and matches the best-known result for \ufb01nding local minima without the\nneed for any delicate acceleration or variance reduction techniques (see Section 1.1 for details).\nWe also empirically show that the stochastic cubic-regularized Newton method proposed in this\npaper performs favorably on both synthetic and real non-convex problems relative to state-of-the-art\noptimization methods.\n1.1 Related Work\n\nThere has been a recent surge of interest in optimization methods that can escape saddle points and\n\ufb01nd \u270f-approximate local minima (see De\ufb01nition 1) in various settings. We provide a brief summary\nof these results. All iteration complexities in this section are stated in terms of \ufb01nding approximate\nlocal minima, and only highlight the dependency on \u270f and d.\n\n1.1.1 Singleton Function\nThis line of work optimizes over a general function f without any special structural assumptions.\nIn this setting, the optimization algorithm has direct access to the gradient or Hessian oracles at\neach iteration. The work of Nesterov and Polyak [2006] \ufb01rst proposed the cubic-regularized Newton\nmethod, which requires O(\u270f1.5) gradient and Hessian oracle calls to the entire f, to \ufb01nd an \u270f-\nsecond-order stationary point. Later, the ARC algorithm [Cartis et al., 2011] and trust-region methods\n[Curtis et al., 2017] were also shown to achieve the same guarantee with similar Hessian oracle\naccess. However, these algorithms rely on having access to the full Hessian at each iteration, which is\nprohibitive in high dimensions.\nRecently, instead of using the full Hessian, Carmon and Duchi [2016] showed that using a gradient\ndescent solver for the cubic regularization subproblem allows their algorithm to \ufb01nd \u270f-second-order\nstationary points in \u02dcO(\u270f2) Hessian-vector product evaluations. With acceleration techniques, the\nnumber of Hessian-vector products can be reduced to \u02dcO(\u270f1.75) [Carmon et al., 2016, Agarwal et al.,\n2017, Royer and Wright, 2017].\n\n1A single query, for a single realization \u21e0 \u21e0 D, to rf (x, \u21e0) or r2f (x, \u21e0)\u00b7 v (for pre-speci\ufb01ed v) is referred\n\nto as a stochastic gradient or stochastic Hessian-vector oracle evaluation.\n\n2\n\n\fMethod\n\nRuntime\n\nVariance Reduction\n\nStochastic Gradient Descent [Ge et al., 2015]\n\nNatasha 2 [Allen-Zhu, 2017]\n\nStochastic Cubic Regularization (this paper)\n\nO(\u270f4poly(d))\n\u02dcO(\u270f3.5)2\n\u02dcO(\u270f3.5)\n\nnot needed\n\nneeded\n\nnot needed\n\nTable 1: Comparison of our results to existing results for stochastic, nonconvex optimization with\nprovable convergence to approximate local minima.\n\nMeanwhile, in the realm of entirely Hessian-free results, Jin et al. [2017] showed that a simple variant\nof gradient descent can \ufb01nd \u270f-second stationary points in \u02dcO(\u270f2) gradient evaluations.\nNote this line of work doesn\u2019t accommodate stochastic gradients or stochastic Hessians. Restricting\naccess to only stochastic function queries makes the optimization problem more challenging.\n\n\u270f3/2 + n3/4d\n\nnPn\n\n1.1.2 Finite-Sum Setting\nIn the \ufb01nite-sum setting (also known as the of\ufb02ine setting) where f (x) := 1\ni=1 fi(x), one\nassumes that algorithms have access to the individual functions fi. In this setting, variance reduction\ntechniques can be exploited [Johnson and Zhang, 2013]. Agarwal et al. [2017] give an algorithm\nrequiring \u02dcO( nd\n\u270f7/4 ) Hessian-vector oracle calls (each to an fi(x)) to \ufb01nd an \u270f-approximate\nlocal minimum. A similar result is achieved by the algorithm proposed by Reddi et al. [2017].\n1.1.3 Stochastic Approximation\nThe framework of stochastic approximation where f (x) := E\u21e0\u21e0D[f (x; \u21e0)] only assumes access to\na stochastic gradient and Hessian via f (x; \u21e0). In this setting, Ge et al. [2015] showed that the total\ngradient iteration complexity for SGD to \ufb01nd an \u270f-second-order stationary point was O(\u270f4poly(d)).\nMore recently, Kohler and Lucchi [2017] consider a subsampled version of the cubic regularization\nalgorithm, but do not provide a non-asymptotic analysis for their algorithm to \ufb01nd an approximate\nlocal minimum; they also assume access to exact (expected) function values at each iteration which\nare not available in the fully stochastic setting. Xu et al. [2017] consider the case of stochastic\nHessians, but also require access to exact gradients and function values at each iteration. Recently,\nAllen-Zhu [2017] proposed an algorithm with a mechanism exploiting variance reduction that \ufb01nds a\nsecond-order stationary point with \u02dcO(\u270f3.5)2 Hessian-vector product evaluations.\nAfter this work, Allen-Zhu and Li [2017] and Xu and Yang [2017] show how to use gradient evalua-\ntions to ef\ufb01ciently approximate Hessian-vector products. Using this technique along with variance\nreduction, they both provide algorithms achieving an \u02dcO(\u270f3.5) rate using gradient evaluations.\nWe note that our result matches the best results so far, using a simpler approach without any delicate\nacceleration or variance-reduction techniques. See Table 1 for a brief comparison.\n2 Preliminaries\nWe are interested in stochastic optimization problems of the form minx2Rd f (x) := E\u21e0\u21e0D[f (x; \u21e0)],\nwhere \u21e0 is a random variable with distribution D. In general, the function f (x) may be nonconvex.\nThis formulation covers both the standard of\ufb02ine setting where the objective function can be expressed\nas a \ufb01nite sum of n individual functions f (x, \u21e0i), as well as the online setting where data arrives\nsequentially.\nOur goal is to minimize the function f (x) using only stochastic gradients rf (x; \u21e0) and stochastic\nHessian-vector products r2f (x; \u21e0) \u00b7 v, where v is a vector of our choosing. Although it is expensive\nand often intractable in practice to form the entire Hessian, computing a Hessian-vector product is as\ncheap as computing a gradient when our function is represented as an arithmetic circuit [Pearlmutter,\n1994], as is the case for neural networks.\nNotation: We use bold uppercase letters A, B to denote matrices and bold lowercase letters x, y to\ndenote vectors. For vectors we use k\u00b7k to denote the `2-norm, and for matrices we use k\u00b7k to denote\n2The original paper reports a rate of \u02dcO(\u270f3.25) due to a different de\ufb01nition of \u270f-second-order stationary\n\npoint, min(r2f (x)) O(\u270f1/4) , which is weaker than the standard de\ufb01nition as in De\ufb01nition 1.\n\n3\n\n\fthe spectral norm and min(\u00b7) to denote the minimum eigenvalue. Unless otherwise speci\ufb01ed, we use\nthe notation O(\u00b7) to hide only absolute constants which do not depend on any problem parameter,\nand the notation \u02dcO(\u00b7) to hide only absolute constants and logarithmic factors.\n2.1 Assumptions\nThroughout the paper, we assume that the function f (x) is bounded below by some optimal value f\u21e4.\nWe also make the following assumptions about function smoothness:\nAssumption 1. The function f (x) has\n\n\u2022 `-Lipschitz gradients and \u21e2-Lipschitz Hessians: for all x1 and x2,\n\nkrf (x1) rf (x2)]k \uf8ff `kx1 x2k;r2f (x1) r2f (x2) \uf8ff \u21e2kx1 x2k.\n\nThe above assumptions state that the gradient and the Hessian cannot change dramatically in a small\nlocal area, and are standard in prior work on escaping saddle points and \ufb01nding local minima.\nNext, we make the following variance assumptions about stochastic gradients and stochastic Hessians:\nAssumption 2. The function f (x, \u21e0) has\n\n\u2022 for all x, Ehkrf (x, \u21e0) rf (x)k2i \uf8ff 2\n\u2022 for all x, Ehr2f (x, \u21e0) r2f (x)2i \uf8ff 2\n\n1 and krf (x, \u21e0) rf (x)k \uf8ff M1 a.s.;\n\n2 andr2f (x, \u21e0) r2f (x) \uf8ff M2 a.s.\n\nWe note that the above assumptions are not essential to our result, and can be replaced by any\nconditions that give rise to concentration. Moreover, the a.s. bounded Hessian assumption can be\nremoved if one further assumes f (x, \u21e0) has `0-Lipschitz gradients for all \u21e0, which guarantees Hessian\nconcentration with no further assumption on the variance of r2f (x, \u21e0) [Tropp et al., 2015].\n2.2 Cubic Regularization\n\nOur target in this paper is to \ufb01nd an \u270f-second-order stationary point, which we de\ufb01ne as follows:\nDe\ufb01nition 1. For a \u21e2-Hessian Lipschitz function f, we say that x is an \u270f-second-order stationary\npoint (or \u270f-approximate local minimum) if\n\nkrf (x)k \uf8ff \u270f\n\nand min(r2f (x)) p\u21e2\u270f.\n\n(2)\n\nAn \u270f-second-order stationary point not only has a small gradient, but also has a Hessian which is\nclose to positive semi-de\ufb01nite. Thus it is often also referred to as an \u270f-approximate local minimum.\nIn the deterministic setting, cubic regularization [Nesterov and Polyak, 2006] is a classic algorithm\nfor \ufb01nding a second-order stationary point of a \u21e2-Hessian-Lipschitz function f (x). In each iteration,\nit \ufb01rst forms a local upper bound on the function using a third-order Taylor expansion around the\ncurrent iterate xt:\n\nmt(x) = f (xt) + rf (xt)>(x xt) +\n\n1\n2\n\n(x xt)>r2f (xt)(x xt) +\n\n\u21e2\n6kx xtk3.\n\nThis is called the cubic submodel. Then, cubic regularization minimizes this cubic submodel to obtain\nthe next iterate: xt+1 = argminx mt(x). When the cubic submodel can be solved exactly, cubic\n\nregularization requires O\u21e3p\u21e2(f (x0)f\u21e4)\n\n\u270f1.5\n\n\u2318 iterations to \ufb01nd an \u270f-second-order stationary point.\n\nTo apply this algorithm in the stochastic setting, three issues need to be addressed: (1) we only have\naccess to stochastic gradients and Hessians, not the true gradient and Hessian; (2) our only means of\ninteraction with the Hessian is through Hessian-vector products; (3) the cubic submodel cannot be\nsolved exactly in practice, only up to some tolerance. We discuss how to overcome each of these\nobstacles in our paper.\n\n3 Main Results\nWe begin with a general-purpose stochastic cubic regularization meta-algorithm in Algorithm 1,\nwhich employs a black-box subroutine to solve stochastic cubic subproblems. At a high level, in\n\n4\n\n\fAlgorithm 1 Stochastic Cubic Regularization (Meta-algorithm)\nInput: mini-batch sizes n1, n2, initialization x0, number of iterations Tout, and \ufb01nal tolerance \u270f.\n1: for t = 0, . . . , Tout do\nSample S1 {\u21e0i}n1\n2:\ngt 1\n3:\n4: Bt[\u00b7] 1\n5: , m Cubic-Subsolver(gt, Bt[\u00b7], \u270f)\n6:\n7:\n8:\n9:\n10:\nend if\n11:\n12: end for\nOutput: x\u21e4 if the early termination condition was reached, otherwise the \ufb01nal iterate xTout+1.\n\ni=1, S2 {\u21e0i}n2\ni=1.\n|S1|P\u21e0i2S1 rf (xt; \u21e0i)\n|S2|P\u21e0i2S2 r2f (xt, \u21e0i)(\u00b7)\n100q \u270f3\n Cubic-Finalsolver(gt, Bt[\u00b7], \u270f)\nx\u21e4 xt + \nbreak\n\nxt+1 xt + \nif m 1\n\n\u21e2 then\n\norder to deal with stochastic gradients and Hessians, we sample two independent minibatches S1 and\nS2 at each iteration. Denoting the average gradient and average Hessian by\n\ngt =\n\n1\n\n|S1| X\u21e0i2S1\n\nrf (xt, \u21e0i), Bt =\n\n1\n\n|S2| X\u21e0i2S2\n\nr2f (xt, \u21e0i),\n\nthis implies a stochastic cubic submodel:\n\n(3)\n\n(4)\n\nmt(x) = f (xt) + (x xt)>gt +\n\n1\n2\n\n(x xt)>Bt(x xt) +\n\n\u21e2\n6kx xtk3.\n\nAlthough the subproblem depends on Bt, we note that our meta-algorithm never explicitly formulates\nthis matrix, only providing the subsolver access to Bt through Hessian-vector products, which\nwe denote by Bt[\u00b7] : Rd ! Rd. We hence assume that the subsolver performs gradient-based\noptimization to solve the subproblem, as rmt(x) depends on Bt only via Bt[x xt].\nAfter sampling minibatches for the gradient and the Hessian, Algorithm 1 makes a call to a black-box\ncubic subsolver to optimize the stochastic submodel mt(x). The subsolver returns a parameter\nchange , i.e., an approximate minimizer of the submodel, along with the corresponding change\nin submodel value, m := mt(xt + ) mt(xt). The algorithm then updates the parameters by\nadding to the current iterate, and checks whether m satis\ufb01es a stopping condition.\nIn more detail, the Cubic-Subsolver subroutine takes a vector g and a function for computing Hessian-\nvector products B[\u00b7], then optimizes the third-order polynomial \u02dcm() = >g + 1\n2 >B[] +\n6kk3. Let ? = argmin \u02dcm() denote the minimizer of this polynomial. In general, the\n\u21e2\nsubsolver cannot return the exact solution ?. We hence tolerate a certain amount of suboptimality:\nCondition 1. For any \ufb01xed, small constant c, Cubic-Subsolver(g, B[\u00b7], \u270f) terminates within T (\u270f)\ngradient iterations (which may depend on c), and returns a satisfying at least one of the following:\n\n1. max{ \u02dcm(), f (xt + ) f (xt)} \uf8ff \u2326(p\u270f3/\u21e2). (Case 1)\n2. kk \uf8ff k?k + cq \u270f\n\n\u21e2 and, if k?k 1\n\n2p\u270f/\u21e2, then \u02dcm() \uf8ff \u02dcm(?) + c\n\n(Case 2)\n\n12 \u00b7 \u21e2k?k3.\n\nThe \ufb01rst condition is satis\ufb01ed if the parameter change results in submodel and function decreases\nthat are both suf\ufb01ciently large (Case 1). If that fails to hold, the second condition ensures that \nis not too large relative to the true solution ?, and that the cubic submodel is solved to precision\nc \u00b7 \u21e2k?k3 when k?k is large (Case 2).\nAs mentioned above, we assume the subsolver uses gradient-based optimization to solve the subprob-\nlem so that it will only access the Hessian only through Hessian-vector products. Accordingly, it\ncan be any standard \ufb01rst-order algorithm such as gradient descent, Nesterov\u2019s accelerated gradient\n\n5\n\n\fAlgorithm 2 Cubic-Subsolver via Gradient Descent\nInput: g, B[\u00b7], tolerance \u270f.\n1: if kgk `2\n\u21e2 then\n2: Rc g>B[g]\ng\n3: Rc\nkgk\n4: else\n5: 0, c0 p\u270f\u21e2\n\n\u21e2kgk2 +r\u21e3 g>B[g]\n\u21e2kgk2 \u23182\n` , \u2318 1\n\n+ 2kgk\u21e2\n\n20`\n\n2kk)\n\n\u02dcg g + \u21e3 for \u21e3 \u21e0 Unif(Sd1)\nfor t = 1, . . . ,T (\u270f) do\n \u2318(\u02dcg + B[] + \u21e2\nend for\n\n6:\n7:\n8:\n9:\n10: end if\n11: m g> + 1\nOutput: , m\n\n2 >B[] + \u21e2\n\n6kk3\n\nAlgorithm 3 Cubic-Finalsolver via Gradient Descent\nInput: g, B[\u00b7], tolerance \u270f.\n1: 0, gm g, \u2318 1\n2 do\n2: while kgmk > \u270f\n3: \u2318gm\n4:\n5: end while\nOutput: \n\ngm g + B[] + \u21e2\n\n2kk\n\n20`\n\ndescent, etc. Gradient descent is of particular interest as it can be shown to satisfy Condition 1 (see\nLemma 1).\nHaving given an overview of our meta-algorithm and veri\ufb01ed the existence of a suitable subsolver,\nwe are ready to present our main theorem:\nTheorem 1. There exists an absolute constant c such that if f (x) satis\ufb01es Assumptions 1, 2,\nc2\u270f2 ) log\u21e3 dp\u21e2f\nCubicSubsolver satis\ufb01es Condition 1 with c, n1 max( M1\nmax( M2\nput an \u270f-second-order stationary point of f with probability at least 1 within\n\u21e2\u270f\u25c6 \u00b7 T (\u270f)\u25c6\u25c6\n\n\u270f1.5c \u2318, and n2 \n\u270f1.5c \u2318, then for all > 0 and f f (x0) f\u21e4, Algorithm 1 will out-\n\u270f1.5 \u2713max\u2713 M1\n\nc2\u21e2\u270f ) log\u21e3 dp\u21e2f\n\u02dcO\u2713p\u21e2f\n\n\u270f2\u25c6 + max\u2713 M2p\u21e2\u270f\n\ncp\u21e2\u270f , 2\n\nc\u270f , 2\n\n(5)\n\n2\n2\n\n2\n1\n\n\u270f\n\n,\n\n,\n\n1\n\n2\n\ntotal stochastic gradient and Hessian-vector product evaluations.\n\nIn the limit where \u270f is small the result simpli\ufb01es:\n\n1\n\n,\n\nc1M1\n\n4\n2\nc2\n2M 2\n\n2 \u21e2o, then under the settings of Theorem 1 we can conclude that\nRemark 1. If \u270f \uf8ff minn 2\nAlgorithm 1 will output an \u270f-second-order stationary point of f with probability at least 1 within\n\u02dcO\u2713p\u21e2f\n\u270f1.5 \u2713 2\n(6)\n\n\u21e2\u270f \u00b7 T (\u270f)\u25c6\u25c6\n\ntotal stochastic gradient and Hessian-vector product evaluations.\nTheorem 1 states that after Tout = \u02dcO(\n) iterations, stochastic cubic regularization\n(Algorithm 1) will have found an \u270f-second-order stationary point w.h.p. Within each iteration, we\nrequire n1 = \u02dcO( 2\n\u21e2\u270f ) samples for Hessian-vector\nproduct averaging. Recall that the averaged gradient gt is \ufb01xed for a given cubic submodel, while\nthe averaged Hessian-vector product Bt[v] needs to be recalculated every time the cubic subsolver\n\n\u270f2 ) samples for gradient averaging and n2 = \u02dcO( 2\n\np\u21e2(f (x0)f\u21e4)\n\n\u270f2 +\n\n2\n2\n\n\u270f1.5\n\n1\n\n2\n\n1\n\n6\n\n\f2\n\n1\n\n\u270f2 + 2\n\nqueries the gradient r \u02dcm(\u00b7). At most T (\u270f) such queries will be made by de\ufb01nition. Thus, each\niteration takes \u02dcO( 2\n\u21e2\u270f \u00b7 T (\u270f)) stochastic gradient/Hessian-vector product evaluations when \u270f is\nsmall (see Remark 1).\nFinally, we note that lines 8-11 of Algorithm 1 give the termination condition of our meta-algorithm.\nWhen the decrease in submodel value m is too small, our theory guarantees xt + ? is an \u270f-\nsecond-order stationary point, where ? is the optimal solution of the cubic submodel. However,\nCubic-Subsolver may only produce an inexact solution satisfying Condition 1, which is not\nsuf\ufb01cient for xt + to be an \u270f-second-order stationary point. We therefore call Cubic-Finalsolver\n(which just uses gradient descent and is detailed in Algorithm 3) to solve the subproblem with higher\nprecision.\n3.1 Gradient Descent as a Cubic-Subsolver\nOne concrete example of a cubic subsolver is a simple variant of gradient descent (Algorithm 2)\nas studied in Carmon and Duchi [2016]. The two main differences relative to standard gradient\ndescent are: (1) lines 1\u20133: when g is large, the submodel \u02dcm() may be ill-conditioned, so instead of\ndoing gradient descent, the iterate only moves one step in the g direction, which already guarantees\nsuf\ufb01cient descent; (2) line 6: the algorithm adds a small perturbation to g to avoid a certain \u201chard\"\ncase for the cubic submodel. We refer readers to Carmon and Duchi [2016] for more details about\nAlgorithm 2.\nAdapting their result for our setting, we obtain the following lemma, which states that the gradient\ndescent subsolver satis\ufb01es our Condition 1.\nLemma 1. There exists an absolute constant c0, such that under the same assumptions on f (x)\nand the same choice of parameters n1, n2 as in Theorem 1, Algorithm 2 satis\ufb01es Condition 1 with\nprobability at least 1 0 with T (\u270f) \uf8ff \u02dcO( `p\u21e2\u270f ).\nOur next corollary applies gradient descent (Algorithm 2) as the approximate cubic subsolver in our\nmeta-algorithm (Algorithm 1), which immediately gives the total number of gradient and Hessian-\nvector evaluations for the full algorithm.\n\nCorollary 1. Under the same settings as Theorem 1, if \u270f \uf8ff minn 2\n2 \u21e2o, and if we instantiate\nthe Cubic-Subsolver subroutine with Algorithm 2, then with probability greater than 1 , Algorithm\n1 will output an \u270f-second-order stationary point of f (x) within\np\u21e2\u270f\u25c6\u25c6\n\n\u02dcO\u2713p\u21e2f\n\n\u270f1.5 \u2713 2\n\n2\n2\n\u21e2\u270f \u00b7\n\n4\n2\nc2\n2M 2\n\n1\n\n\u270f2 +\n\n(7)\n\ntotal stochastic gradient and Hessian-vector product evaluations.\n\n1\n\nc1M1\n\n,\n\n`\n\nFrom Corollary 1, we observe that the dominant term in solving the submodel is 2\n\u270f2 when \u270f is\nsuf\ufb01ciently small, giving a total iteration complexity of \u02dcO(\u270f3.5) when other problem-dependent\nparameters are constant. This improves on the \u02dcO(\u270f4poly(d)) complexity attained by SGD.\n4 Proof Sketch\n\n1\n\nThis section sketches the crucial steps needed to understand and prove our main theorem (Theorem\n1). We describe our high-level approach, and provide a proof sketch in Appendix A in the stochastic\nsetting, assuming oracle access to an exact subsolver. For the case of an inexact subsolver and full\nproof details, we defer to Appendix B.\nRecall that at iteration t of Algorithm 1, a stochastic cubic submodel mt is constructed around the\ncurrent iterate xt with the form given in the following Equation, mt(x) = f (xt) + (x xt)>gt +\n6kx xtk3, where gt and Bt are the averaged stochastic gradients and\n2 (x xt)>Bt(x xt) + \u21e2\n1\nHessians. At a high level, we will show that for each iteration, the following two claims hold:\nClaim 1. If xt+1 is not an \u270f-second-order stationary point of f (x), the cubic submodel has large\ndescent mt(xt+1) mt(xt).\nClaim 2. If the cubic submodel has large descent mt(xt+1) mt(xt), then the true function also\nhas large descent f (xt+1) f (xt).\n\n7\n\n\fFigure 1: The piecewise cubic function w(x) used along one of the dimensions in the synthetic\nexperiment. The other dimension uses a scaled quadratic.\n\nGiven these claims, it is straightforward to argue for the correctness of our approach. We know that if\nwe observe a large decrease in the cubic submodel value mt(xt+1) mt(xt) during Algorithm 1,\nthen by Claim 2 the function will also have large descent. But since f is bounded below, this cannot\nhappen inde\ufb01nitely, so we must eventually encounter an iteration with small cubic submodel descent.\nWhen that happens, we can conclude by Claim 1 that xt+1 is an \u270f-second-order stationary point.\nWe note that Claim 2 is especially important in the stochastic setting, as we no longer have access to\nthe true function but only the submodel. Claim 2 ensures that progress in mt still indicates progress\nin f, allowing the algorithm to terminate at the correct time. A more detailed proof sketch and full\nproofs can be found in the Appendix.\n5 Experiments\nIn this section, we provide empirical results on synthetic and real-world data sets to demonstrate the\nef\ufb01cacy of our approach. All experiments are implemented using TensorFlow [Abadi et al., 2016],\nwhich allows for ef\ufb01cient computation of Hessian-vector products using the method described by\nPearlmutter [1994].\n\n5.1 Synthetic Nonconvex Problem\n\nWe begin by constructing a nonconvex problem with a saddle point to compare our proposed approach\nagainst stochastic gradient descent. Let w(x) be the W-shaped scalar function depicted in Figure 1,\nwith a local maximum at the origin and two local minima on either side. We defer the exact form of\nw(x) to Appendix D.\nWe aim to solve the problem\n\nmin\n\nx2R2\u21e5w(x1) + 10x2\n2\u21e4 ,\n\nwith independent noise drawn from N (0, 1) added separately to each component of every gradient\nand Hessian-vector product evaluation. By construction, the objective function has a saddle point at\nthe origin with Hessian eigenvalues of -0.2 and 20, providing a simple but challenging test case.\nWe ran our method and SGD on this problem, plotting the objective value versus the number of oracle\ncalls in Figure 2. The batch sizes and learning rates for each method are tuned separately to ensure a\nfair comparison; see Appendix D for details. We observe that our method is able to escape the saddle\npoint at the origin and converge to one of the global minima faster than SGD.\n5.2 Deep Autoencoder\nIn addition to the synthetic problem above, we also investigate the performance of our approach\non a more practical problem from deep learning, namely training a deep autoencoder on MNIST\n[LeCun and Cortes, 2010]. Our architecture consists of a fully connected encoder with dimensions\n(28 \u21e5 28) ! 512 ! 256 ! 128 ! 32 together with a symmetric decoder. We use a softplus\nnonlinearity (de\ufb01ned as softplus(x) = log(1 + exp(x))) for each hidden layer, an elementwise\nsigmoid for the \ufb01nal layer, and a pixelwise `2 loss between the input and the reconstructed output\nas our objective function. Results on this autoencoding task are presented in Figure 3. In addition\nto training the model with our method and SGD, we also include results using AdaGrad, a popular\nadaptive \ufb01rst-order method with strong empirical performance [Duchi et al., 2011], together with\nresults for a method combining variance reduction and second-order information proposed by Reddi\net al. [2017]. More details about our experimental setup can be found in Appendix D.\n\n8\n\n-0.50.5-0.006-0.004-0.0020.0020.0040.006\fFigure 2: Results on our synthetic nonconvex\noptimization problem. Stochastic cubic regular-\nization escapes the saddle point at the origin and\nconverges to a global optimum faster than SGD.\n\nFigure 3: Results on the MNIST deep autoencod-\ning task. Multiple saddle points are present in\nthe optimization problem. Stochastic cubic reg-\nularization is able to escape them most quickly,\nallowing it to reach a local minimum faster than\nother methods.\n\nWe observe that stochastic cubic regularization quickly escapes two saddle points and descends toward\na local optimum, while AdaGrad takes two to three times longer to escape each saddle point, and\nSGD is slower still. This demonstrates that incorporating curvature information via Hessian-vector\nproducts can assist in escaping saddle points in practice. The hybrid method consisting of SVRG\ninterleaved with Oja\u2019s algorithm for Hessian descent also improves upon SGD but does not match the\nperformance of our method or AdaGrad.\n\n6 Conclusion\nIn this paper, we presented a stochastic algorithm based on the classic cubic-regularized Newton\nmethod for nonconvex optimization. Our algorithm provably \ufb01nds \u270f-approximate local minima in\n\u02dcO(\u270f3.5) total gradient and Hessian-vector product evaluations, improving upon the \u02dcO(\u270f4poly(d))\nrate of SGD. Our experiments show the favorable performance of our method relative to SGD on\nboth a synthetic and a real-world problem.\n\nReferences\nMartin Abadi, Paul Barham, et al. Tensor\ufb02ow: A system for large-scale machine learning. In 12th\n\nUSENIX Symposium on Operating Systems Design and Implementation, pages 265\u2013283, 2016.\n\nNaman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate\nlocal minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT\nSymposium on Theory of Computing, 2017.\n\nZeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD.\n\narXiv:1708.08694, 2017.\n\narXiv preprint\n\nZeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via \ufb01rst-order oracles. arXiv\n\npreprint arXiv:1711.06673, 2017.\n\nYair Carmon and John Duchi. Gradient descent ef\ufb01ciently \ufb01nds the cubic-regularized non-convex\n\nNewton step. arXiv preprint arXiv:1612.00547, 2016.\n\nYair Carmon, John Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for non-convex\n\noptimization. arXiv preprint arXiv:1611.00756, 2016.\n\nCoralia Cartis, Nicholas Gould, and Philippe Toint. Adaptive cubic regularisation methods for\nunconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity.\nMathematical Programming, 130(2):295\u2013319, 2011.\n\n9\n\n\fAnna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The\nloss surfaces of multilayer networks. In Proceedings of the Eighteenth International Conference\non Arti\ufb01cial Intelligence and Statistics, 2015.\n\nFrank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a\nworst-case iteration complexity of O(\u270f3/2) for nonconvex optimization. Mathematical Program-\nming, 162(1-2):1\u201332, 2017.\nJohn C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\n\nand stochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\nRong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points \u2014 online stochastic\ngradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory,\n2015.\n\nRong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A\nuni\ufb01ed geometric analysis. In Proceedings of the 34th International Conference on Machine\nLearning, 2017.\n\nPrateek Jain, Chi Jin, Sham M Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming PCA:\nMatching matrix Bernstein and near-optimal \ufb01nite sample guarantees for Oja\u2019s algorithm. In\nConference on Learning Theory, pages 1147\u20131164, 2016.\n\nChi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape\nsaddle points ef\ufb01ciently. In Proceedings of the 34th International Conference on Machine Learning,\n2017.\n\nRie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, 2013.\n\nJonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex opti-\n\nmization. In Proceedings of the 34th International Conference on Machine Learning, 2017.\n\nYann LeCun and Corinna Cortes. MNIST Handwritten Digit Database, 2010. URL http://yann.\n\nlecun.com/exdb/mnist/.\n\nYurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer\n\nScience & Business Media, 2013.\n\nYurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance.\n\nMathematical Programming, 108(1):177\u2013205, 2006.\n\nBarak A Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147\u2013160,\n\n1994.\n\nSashank J Reddi, Manzil Zaheer, et al. A generic approach for escaping saddle points. arXiv preprint\n\narXiv:1709.01434, 2017.\n\nHerbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical\n\nStatistics, pages 400\u2013407, 1951.\n\nCl\u00e9ment W Royer and Stephen J Wright. Complexity analysis of second-order line-search algorithms\n\nfor smooth nonconvex optimization. arXiv preprint arXiv:1706.03131, 2017.\n\nJu Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere I: Overview and\n\nthe geometric picture. IEEE Transactions on Information Theory, 2016a.\n\nJu Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. In IEEE International\n\nSymposium on Information Theory. IEEE, 2016b.\n\nin Machine Learning, 8(1-2):1\u2013230, 2015.\n\nJoel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends R\nPeng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Newton-type methods for non-convex\n\noptimization under inexact Hessian information. arXiv preprint arXiv:1708.07164, 2017.\n\nYi Xu and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost\n\nlinear time. arXiv preprint arXiv:1711.01944, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1520, "authors": [{"given_name": "Nilesh", "family_name": "Tripuraneni", "institution": "UC Berkeley"}, {"given_name": "Mitchell", "family_name": "Stern", "institution": "UC Berkeley"}, {"given_name": "Chi", "family_name": "Jin", "institution": "University of California, Berkeley"}, {"given_name": "Jeffrey", "family_name": "Regier", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}