Downhill
  • About

On this page

  • Intro
  • Quick review of Taylor’s theorem
  • \(p\)-criticality for 1D functions
  • \(p\)-criticality for arbitrary dimension
  • Third-order critical points (\(p=3\))
  • Counterexamples
  • What is the “right” definition of \(p\)-criticality?
  • On the complexity of finding \(p\)-critical points

Higher-order critical points: definitions and curiosities

optimization
higher-order derivatives
critical points
counterexamples
Everyone knows what a 1st or 2nd order critical point is. But how do we define higher-order criticality? In this post we consider several natural definitions of higher-order critical points, and show that, surprisingly, these definitions are all different. Our goal is to point out some facts all optimizers should know (and which at times might be hard to find in a single place), and help avoid some common misconceptions.
Author

Christopher Criscitiello (christopher.criscitiello@epfl.ch)

Published

September 20, 2024

Modified

October 3, 2024

\[ \newcommand{\abs}[1]{\left|{#1}\right|} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\bareta}{{\overline{\eta}}} \newcommand{\barf}{{\overline{f}}} \newcommand{\barg}{{\overline{g}}} \newcommand{\barM}{\overline{\mathcal{M}}} \newcommand{\barnabla}{\overline{\nabla}} \newcommand{\barRcurv}{\bar{\mathcal{R}}} \newcommand{\barRmop}{\bar{R}_\mathrm{m}} \newcommand{\barRmoptensor}{\mathbf{{\bar{R}_\mathrm{m}}}} \newcommand{\barx}{{\overline{x}}} \newcommand{\barX}{{\overline{X}}} \newcommand{\barxi}{{\overline{\xi}}} \newcommand{\bary}{{\overline{y}}} \newcommand{\barY}{{\overline{Y}}} \newcommand{\bfeta}{{\bm{\eta}}} \newcommand{\bfF}{\mathbf{F}} \newcommand{\bfOmega}{\boldsymbol{\Omega}} \newcommand{\bfR}{\mathbf{R}} \newcommand{\bftheta}{{\bm{\theta}}} \newcommand{\bfX}{{\mathbf{X}}} \newcommand{\bfxi}{{\bm{\xi}}} \newcommand{\bfY}{{\mathbf{Y}}} \newcommand{\calA}{\mathcal{A}} \newcommand{\calB}{\mathcal{B}} \newcommand{\calP}{\mathcal{P}} \newcommand{\calC}{\mathcal{C}} \newcommand{\calE}{\mathcal{E}} \newcommand{\calF}{\mathcal{F}} \newcommand{\calG}{\mathcal{G}} \newcommand{\calH}{\mathcal{H}} \newcommand{\calJ}{\mathcal{J}} \newcommand{\calK}{\mathcal{K}} \newcommand{\calL}{\mathcal{L}} \newcommand{\ones}{\textbf{1}} \newcommand{\calM}{\mathcal{M}} \newcommand{\calN}{\mathcal{N}} \newcommand{\calO}{\mathcal{O}} \newcommand{\calS}{\mathcal{S}} \newcommand{\calT}{\mathcal{T}} \newcommand{\calU}{\mathcal{U}} \newcommand{\calV}{\mathcal{V}} \newcommand{\calW}{\mathcal{W}} \newcommand{\calX}{\mathcal{X}} \newcommand{\Cm}{{\mathbb{C}^m}} \newcommand{\CN}{{\mathbb{C}^{\,N}}} \newcommand{\Cn}{\mathbb{C}^n} \newcommand{\CNN}{{\mathbb{C}^{\ N\times N}}} \newcommand{\Cnn}{\mathbb{C}^{n\times n}} \newcommand{\CnotR}{{\mathbb{C}\ \backslash\mathbb{R}}} \newcommand{\col}{\operatorname{col}} \newcommand{\im}{\operatorname{im}} \newcommand{\complex}{{\mathbb{C}}} \newcommand{\Cov}{\mathbf{C}} \newcommand{\Covm}{C} \newcommand{\Ric}{\mathrm{Ric}} \newcommand{\Rmcurv}{\mathrm{Rm}} \newcommand{\D}{\mathrm{D}} \newcommand{\dblquote}[1]{``#1''} \newcommand{\dd}[1]{\frac{\mathrm{d}}{\mathrm{d}#1}} \newcommand{\ddiag}{\mathrm{ddiag}} \newcommand{\ddp}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\Ddq}{\frac{\D}{\dq}} \newcommand{\dds}{\frac{\mathrm{d}}{\mathrm{d}s}} \newcommand{\Ddt}{\frac{\D}{\dt}} \newcommand{\ddt}{\frac{\mathrm{d}}{\mathrm{d}t}} \newcommand{\Ddttwo}{\frac{\D^2}{\dt^2}} \newcommand{\defeq}{\triangleq} \newcommand{\Dexp}{\D\!\exp} \newcommand{\error}{\mathrm{error}} \newcommand{\diag}{\mathrm{diag}} \newcommand{\Diag}{\mathrm{Diag}} \newcommand{\Vol}{\mathrm{Vol}} \newcommand{\polylog}{\mathrm{polylog}} \newcommand{\poly}{\mathrm{poly}} \newcommand{\dist}{\mathrm{dist}} \newcommand{\Dlog}{\D\!\log} \newcommand{\dmu}{\mathrm{d}\mu} \newcommand{\dphi}{\mathrm{d}\phi} \newcommand{\dq}{\mathrm{d}q} \newcommand{\ds}{\mathrm{d}s} \newcommand{\dt}{\mathrm{d}t} \newcommand{\dtau}{\mathrm{d}\tau} \newcommand{\dtheta}{\mathrm{d}\theta} \newcommand{\du}{\mathrm{d}u} \newcommand{\dx}{\mathrm{d}x} \newcommand{\Ebad}{E_\textrm{bad}} \newcommand{\Egood}{E_\textrm{good}} \newcommand{\eig}{{\mathrm{eig}}} \newcommand{\ellth}{$\ell$th } \newcommand{\embdist}{\mathrm{embdist}} \newcommand{\ER}{{Erd\H{o}s-R\'enyi }} \newcommand{\Exp}{\mathrm{Exp}} \newcommand{\expect}{\mathbb{E}} \newcommand{\expectt}[1]{\mathbb{E}\left\{{#1}\right\}} \newcommand{\expp}[1]{{\exp\!\big({#1}\big)}} \newcommand{\feq}{{f_\mathrm{eq}}} \newcommand{\FIM}{\mathbf{F}} \newcommand{\FIMm}{F} \newcommand{\floor}[1]{\lfloor #1 \rfloor} \newcommand{\flow}{f_{\mathrm{low}}} \newcommand{\FM}{\mathfrak{F}(\M)} \newcommand{\frobnorm}[2][F]{\left\|{#2}\right\|_\mathrm{#1}} \newcommand{\frobnormbig}[1]{\big\|{#1}\big\|_\mathrm{F}} \newcommand{\frobnormsmall}[1]{\|{#1}\|_\mathrm{F}} \newcommand{\GLr}{{\mathbb{R}^{r\times r}_*}} \newcommand{\Gr}{\mathrm{Gr}} \newcommand{\grad}{\mathrm{grad}} \newcommand{\Hess}{\mathrm{Hess}} \newcommand{\HH}{\mathrm{H}} \newcommand{\Id}{\operatorname{Id}} % need to be distinguishable from identity matrix \newcommand{\II}{I\!I} \newcommand{\inj}{\mathrm{inj}} \newcommand{\inner}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\innerbig}[2]{\big\langle{#1},{#2}\big\rangle} \newcommand{\innersmall}[2]{\langle{#1},{#2}\rangle} \newcommand{\ith}{$i$th } \newcommand{\jth}{$j$th } \newcommand{\Klow}{{K_{\mathrm{low}}}} \newcommand{\Kmax}{K_{\mathrm{max}}} \newcommand{\kth}{$k$th } \newcommand{\Kup}{{K_{\mathrm{up}}}} \newcommand{\Klo}{{K_{\mathrm{lo}}}} \newcommand{\lambdamax}{\lambda_\mathrm{max}} \newcommand{\lambdamin}{\lambda_\mathrm{min}} \newcommand{\langevin}{\mathrm{Lang}} \newcommand{\langout}{\mathrm{LangUni}} \newcommand{\length}{\operatorname{length}} \newcommand{\Log}{\mathrm{Log}} \newcommand{\longg}{{\mathrm{long}}} \newcommand{\M}{\mathcal{M}} \newcommand{\mle}{{\mathrm{MLE}}} \newcommand{\mlez}{{\mathrm{MLE0}}} \newcommand{\MSE}{\mathrm{MSE}} \newcommand{\N}{\mathrm{N}} \newcommand{\norm}[1]{\left\|{#1}\right\|} \newcommand{\nulll}{\operatorname{null}} \newcommand{\hull}{\mathrm{hull}} \newcommand{\Od}{{\mathrm{O}(d)}} \newcommand{\On}{{\mathrm{O}(n)}} \newcommand{\One}{\mathds{1}} \newcommand{\Op}{{\mathrm{O}(p)}} \newcommand{\opnormsmall}[1]{\|{#1}\|_\mathrm{op}} \newcommand{\p}{\mathcal{P}} \newcommand{\pa}[1]{\left(#1\right)} \newcommand{\paa}[1]{\left\{#1\right\}} \newcommand{\pac}[1]{\left[#1\right]} \newcommand{\pdd}[1]{\frac{\partial}{\partial #1}} \newcommand{\polar}{\mathrm{polar}} \newcommand{\Precon}{\mathrm{Precon}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\Proj}{\mathrm{Proj}} \newcommand{\ProjH}{\mathrm{Proj}^h} \newcommand{\PSD}{\mathrm{PSD}} \newcommand{\ProjV}{\mathrm{Proj}^v} \newcommand{\qf}{\mathrm{qf}} \newcommand{\rank}{\operatorname{rank}} \newcommand{\Rball}{R_{\mathrm{ball}}} \newcommand{\Rcurv}{\mathcal{R}} \newcommand{\Rd}{{\mathbb{R}^{d}}} \newcommand{\Rdd}{{\mathbb{R}^{d\times d}}} \newcommand{\Rdp}{{\mathbb{R}^{d\times p}}} \newcommand{\reals}{{\mathbb{R}}} \newcommand{\Retr}{\mathrm{R}} \newcommand{\Rk}{{\mathbb{R}^{k}}} \newcommand{\Rm}{{\mathbb{R}^m}} \newcommand{\Rmle}{{\hat\bfR_\mle}} \newcommand{\Rmm}{{\mathbb{R}^{m\times m}}} \newcommand{\Rmn}{{\mathbb{R}^{m\times n}}} \newcommand{\Rmnr}{{\mathbb{R}_r^{m\times n}}} \newcommand{\Rmop}{R_\mathrm{m}} \newcommand{\Rmoptensor}{\mathbf{{R_\mathrm{m}}}} \newcommand{\Rmr}{\mathbb{R}^{m\times r}} \newcommand{\RMSE}{\mathrm{RMSE}} \newcommand{\Rn}{{\mathbb{R}^n}} \newcommand{\Rnk}{\mathbb{R}^{n\times k}} \newcommand{\Rnm}{\mathbb{R}^{n\times m}} \newcommand{\Rnn}{{\mathbb{R}^{n\times n}}} \newcommand{\RNN}{{\mathbb{R}^{N\times N}}} \newcommand{\Rnp}{{\mathbb{R}^{n\times p}}} \newcommand{\Rnr}{\mathbb{R}^{n\times r}} \newcommand{\Rp}{{\mathbb{R}^{p}}} \newcommand{\Rpp}{\mathbb{R}^{p\times p}} \newcommand{\Rrn}{{\mathbb{R}^{r\times n}}} \newcommand{\Rrr}{\mathbb{R}^{r\times r}} \newcommand{\Rtt}{{\mathbb{R}^{3\times 3}}} \newcommand{\sbd}[1]{\sbdop\!\left({#1}\right)} \newcommand{\sbdop}{\operatorname{symblockdiag}} \newcommand{\sbdsmall}[1]{\sbdop({#1})} \newcommand{\Sd}{\mathbb{S}^{d}} \newcommand{\Sdd}{{\mathbb{S}^{d\times d}}} \newcommand{\short}{{\mathrm{short}}} \newcommand{\sigmamax}{\sigma_{\operatorname{max}}} \newcommand{\sigmamin}{\sigma_{\operatorname{min}}} \newcommand{\sign}{\mathrm{sign}} \newcommand{\ske}{\operatorname{skew}} \newcommand{\Skew}{\operatorname{Skew}} \newcommand{\skeww}[1]{\operatorname{skew}\!\left( #1 \right)} \newcommand{\Sm}{\mathbb{S}^{m-1}} \newcommand{\smallfrobnorm}[1]{\|{#1}\|_\mathrm{F}} \newcommand{\Smdmd}{{\mathbb{S}^{md\times md}}} \newcommand{\Smm}{{\mathbb{S}^{m\times m}}} \newcommand{\Sn}{\mathbb{S}^{n-1}} \newcommand{\Snn}{{\mathbb{S}^{n\times n}}} \newcommand{\So}{{\mathbb{S}^{1}}} \newcommand{\SOd}{\operatorname{SO}(d)} \newcommand{\SOdm}{\operatorname{SO}(d)^m} \newcommand{\son}{{\mathfrak{so}(n)}} \newcommand{\SOn}{{\mathrm{SO}(n)}} \newcommand{\sot}{{\mathfrak{so}(3)}} \newcommand{\SOt}{{\mathrm{SO}(3)}} \newcommand{\sotwo}{{\mathfrak{so}(2)}} \newcommand{\SOtwo}{{\mathrm{SO}(2)}} \newcommand{\Sp}{{\mathbb{S}^{2}}} \newcommand{\spann}{\mathrm{span}} \newcommand{\gspan}{\mathrm{gspan}} \newcommand{\Spp}{{\mathbb{S}^{p\times p}}} \newcommand{\sqfrobnorm}[2][F]{\frobnorm[#1]{#2}^2} \newcommand{\sqfrobnormbig}[1]{\frobnormbig{#1}^2} \newcommand{\sqfrobnormsmall}[1]{\frobnormsmall{#1}^2} \newcommand{\sqnorm}[1]{\left\|{#1}\right\|^2} \newcommand{\Srr}{{\mathbb{S}^{r\times r}}} \newcommand{\St}{\mathrm{St}} \newcommand{\Cent}{\mathrm{Centered}} \newcommand{\Holl}{\mathrm{Hollow}} \newcommand{\Stdpm}{\St(d, p)^m} \newcommand{\Stnp}{{\mathrm{St}(n,p)}} \newcommand{\Stwo}{{\mathbb{S}^{2}}} \newcommand{\symm}[1]{\symmop\!\left( #1 \right)} \newcommand{\symmop}{\operatorname{sym}} \newcommand{\symmsmall}[1]{\symmop( #1 )} \newcommand{\Symn}{\mathrm{Sym}(n)} \newcommand{\Symnminp}{\mathrm{Sym}(n-p)} \newcommand{\Symp}{\mathrm{Sym}(p)} \newcommand{\Sym}{\mathrm{Sym}} \newcommand{\T}{\mathrm{T}} \newcommand{\TODO}[1]{{\color{red}{[#1]}}} \newcommand{\trace}{\mathrm{Tr}} \newcommand{\Trace}{\mathrm{Tr}} \newcommand{\Trans}{\mathrm{Transport}} \newcommand{\Transbis}[2]{\mathrm{Transport}_{{#1}}({#2})} \newcommand{\transpose}{^\top\! } \newcommand{\uniform}{\mathrm{Uni}} \newcommand{\var}{\mathrm{var}} \newcommand{\vecc}{\mathrm{vec}} \newcommand{\veccc}[1]{\vecc\left({#1}\right)} \newcommand{\VV}{\mathrm{V}} \newcommand{\XM}{\mathfrak{X}(\M)} \newcommand{\xorigin}{x_{\mathrm{ref}}} \newcommand{\Zeq}{{Z_\mathrm{eq}}} \newcommand{\aref}[1]{\hyperref[#1]{A\ref{#1}}} \newcommand{\frobnorm}[1]{\left\|{#1}\right\|_\mathrm{F}} \newcommand{\sqfrobnorm}[1]{\frobnorm{#1}^2} \]

Intro

We want to minimize a function \(f \colon \mathbb{R}^d \to \mathbb{R}\), which we assume is \(C^\infty\) for simplicity: \[\min_{x \in \mathbb{R}^d} f(x).\]

In general finding a global minimum of \(f\) can be very difficult, so we often settle for finding a local minimum, i.e., a point \(x\) such that \(f(y) \geq f(x)\) for all \(y\) in a neighborhood of \(x\).

If \(x\) is a local minimum of \(f\), then it must satisfy the following two properties:

  1. \(\nabla f(x) = 0\), where \(\nabla f(x) \in \mathbb{R}^d\) is the gradient of \(f\) at \(x\);

  2. \(\nabla^2 f(x) \succeq 0\), where \(\nabla^2 f(x) \in \mathbb{R}^{d \times d}\) is the Hessian of \(f\) at \(x\).

If item (1) is satisfied, \(x\) is said to be 1-critical for \(f\); if items (1) and (2) are both satisfied, then \(x\) is said to be 2-critical for \(f\). In general, neither of these conditions suffice to guarantee that \(x\) is a local minimum of \(f\): they are necessary but not sufficient.

Thus, it is natural to seek stronger additional conditions a local minimum must satisfy. The most natural ones are higher-order generalizations of \(1\)- and \(2\)-criticality, which we call \(p\)-criticality (for \(p\) a positive integer) — this is the subject of this blog post. But before we dive into this, let us briefly review Taylor’s theorem (see also this informative MSE post).

Quick review of Taylor’s theorem

The \(C^\infty\) function \(f \colon \reals^d \to \reals\) has the Taylor expansion (around \(x\)) \[f(x + v) \sim \sum_{n=0}^\infty \frac{1}{n!} \D^n f(x)[v, \ldots, v] \quad \quad \text{ for $v \in \reals^d$}.\]

Remark

We have the following classical relationships between \(\D^n\) and the gradient and Hessian:

  • \(\D f(x)[v] = \langle \nabla f(x), v \rangle\)

  • \(\D^2 f(x)[v,v] = \langle v, \nabla^2 f(x)[v]\rangle\)

The \(p\)-th order Taylor approximation \(f_p\) to \(f\) is the degree-\(p\) polynomial \[f_p(x + v) := \sum_{n=0}^p \frac{1}{n!} \D^n f(x)[v, \ldots, v].\] In the case \(d=1\), this simplifies to the familiar formula \[f_p(x + t) := \sum_{n=0}^p \frac{f^{(n)}(x)}{n!} t^n = f(x) + f'(x) t + \frac{f''(x)}{2} t^2 + \frac{f'''(x)}{6} t^3 + \ldots + \frac{f^{(p)}(x)}{p!} t^p.\]

Taylor’s theorem says that for \(v\) small enough, \(f\) is close to \(f_p\):

Taylor’s theorem

Let \(f\) be \(C^{\infty}\). For every \(\epsilon > 0\), there exists \(C > 0\) such that \[|f(x + v) - f_p(x+v)| \leq C \|v\|^{p+1} \quad \quad \text{for all $v \in \reals^d$ such that $\|v\| \leq \epsilon$.}\]

Although very classical, let’s recall the proof of Taylor’s theorem; after all the main idea is very simple.

Proof of Taylor’s theorem

We will prove the following formula by induction on \(p \geq 0\): \[ \begin{align}\label{eq1}\tag{1} f(x + v) - f_p(x) = \int_0^1 \int_0^{t_1} \cdots \int_0^{t_{p}} \D^{p+1} f(x + t_{p+1} v)[v, \ldots, v] d t_{p+1} d t_p \cdots d t_1. \end{align} \]

But for the moment, take this formula for granted. Let \[c = \sup_{u, w \in \reals^d : \|u\|=1, \|w\| \leq \epsilon} |\D^{p+1} f(x + w)[u, \ldots, u]|.\] By continuity of \(\D^{p+1} f\) and compactness, this \(\sup\) is actually a \(\max\), and the maximum is finite. Therefore, we have \[ \begin{align} |f(x + v) - f_p(x)| &\leq \|v\|^{p+1} \cdot \int_0^1 \int_0^{t_1} \cdots \int_0^{t_{p}} \Bigg|\D^{p+1} f(x + t_{p+1} v)\bigg[\frac{v}{\|v\|}, \ldots, \frac{v}{\|v\|}\bigg]\Bigg| d t_{p+1} d t_p \cdots d t_1 \\ &\leq \frac{c}{(p+1)!} \|v\|^{p+1}. \end{align} \]


Now, let’s prove equation \(\eqref{eq1}\) by induction.

Base case \(p=0\): \[ \begin{align} f(x + v) - f(x) &= \int_0^1 \frac{d}{dt}[f(x + t v)] dt = \int_0^1 \D f(x + t v)[v] dt. \end{align} \]

Inductive step: Starting with equation \(\eqref{eq1}\) and subtracting \(\frac{1}{(p+1)!} \D^{p+1} f(x)[v, \ldots, v]\) from both sides, we have \[ \begin{align} f(x + v) - f_{p+1}(x) &= \int_0^1 \int_0^{t_1} \cdots \int_0^{t_{p}} \D^{p+1} f(x + t_{p+1} v)[v, \ldots, v] - \D^{p+1} f(x)[v, \ldots, v] d t_{p+1} d t_p \cdots d t_1 \\ &= \int_0^1 \int_0^{t_1} \cdots \int_0^{t_{p}} \int_0^{t_{p+1}} \frac{d}{dt_{p+2}}[\D^{p+1} f(x + t_{p+2} v)[v, \ldots, v]] d t_{p+2} d t_{p+1} \cdots d t_1 \\ &= \int_0^1 \int_0^{t_1} \cdots \int_0^{t_{p}} \int_0^{t_{p+1}} \D^{p+2} f(x + t_{p+2} v)[v, \ldots, v] d t_{p+2} d t_{p+1} \cdots d t_1 \end{align}\] which concludes the inductive step, and proves equation \(\eqref{eq1}\).

\(p\)-criticality for 1D functions

Assume \(d=1\), so that \(f \colon \reals \to \reals\). Let \(q\) be the smallest positive integer so that \(f^{(q)}(x) \neq 0\) (with \(q = \infty\) if such an integer does not exist). By Taylor’s theorem, \(f(x+t) = \frac{f^{(q)}(x)}{q!} t^q + O(t^{q+1})\), provided \(q < \infty\).

Equivalent definitions of \(p\)-critical point for 1D functions

\(x \in \reals\) is \(p\)-critical for \(f \colon \reals \to \reals\) if any of the following hold:

  • [through neighborhoods] There is a \(C > 0\) such that \(f(x+t) \geq f(x) - C t^{p+1}\) for \(t\) small enough.

  • [through derivatives] Either (i) \(q > p\) (i.e., \(f_p\) is constant) or (ii) \(q \leq p\) and \(q\) is even and \(f^{(q)}(x) > 0\).

  • [through series] \(f_p\) has a local minimum at \(x\): \(f_p(x + t) \geq f(x)\) for all \(t\) small enough.

It is a straightforward exercise to show that these three definitions are equivalent (try it!). Note that the definition through derivatives gives us a way to determine whether \(x\) is \(p\)-critical from information about the derivatives of \(f\); this can be quite useful. Moreover, note that if case (ii) holds, then in fact \(x\) is also a (strict) local minimum. A few examples may help:

  • If \(p=1\) or \(2\), note that this matches the notion of \(1, 2\)-criticality defined in the Intro.

  • \(x\) is \(3\)-critical for \(f\) if \(f'(x) = 0\) and either (i) \(f''(x) = f'''(x) = 0\) or (ii) \(f''(x) > 0\).

From the definition of \(p\)-criticality through neighborhoods, it is immediate that if \(x\) is a local minimum of \(f \colon \reals \to \reals\), then \(x\) is \(p\)-criticality for all \(p \geq 1\). It turns out the converse is NOT true in general.

Counterexample: When does \(p\)-critical imply local minimum?

Consider the “infinitely flat” 1D function \[f(x) = x e^{-1/x^2}\] Exercise: Show that \(f\) is \(C^\infty\), \(0\) is \(p\)-critical for all \(p \geq 1\), and yet \(0\) is not a local minimum of \(f\).

Figure 1: Plot of \(f(x) = x e^{-1/x^2}\).

However, if we additionally know that \(f\) is real-analytic, then the converse is true. The following propsition is an easy consequence of our characterization of 1D \(p\)-criticality through derivatives.

Proposition: When does \(p\)-critical imply local minimum in 1D?
  • If \(f \colon \reals \to \reals\) is real-analytic and \(x\) is \(p\)-critical for all \(p \geq 1\), then \(x\) is a local minimum of \(f\).

  • If \(f \colon \reals \to \reals\) is a degree-\(k\) polynomial and \(x\) is \(k\)-critical for \(f\), then \(x\) is a local minimum of \(f\).

\(p\)-criticality for arbitrary dimension

Now let us move on to arbitrary dimension \(d\). We may consider the following definitions for \(x\) being \(p\)-critical for \(f\). We shall see that, in general, none of these definitions are equivalent.

Definitions of \(p\)-critical point

\(x \in \reals^d\) is a \(p\)-critical point of \(f\) through (“straight lines”, “curves”, “neighborhoods”, “series”) if:

  1. [through straight lines] For all \(v \in \mathbb{R}^d\), the 1D function \(t \mapsto f(x + t v)\) has a \(p\)-th order critical point at \(t=0\).

  2. [through curves] For all \(C^p\) curves \(\gamma\) with \(\gamma(0) = x\) and \(\gamma'(0) \neq 0\), the 1D function \(t \mapsto f(\gamma(t))\) has a \(p\)-th order critical point at \(t=0\).

  3. [through neighborhoods] There is an \(\epsilon > 0\) and \(C > 0\) such that \(f(x + v) \geq f(x) - C \|v\|^{p+1}\) for all \(v \in \reals^d\) such that \(\|v\| < \epsilon\). (This is definition 1 in this paper of Anandkumar and Ge.)

Some observations:

  • Definitions (a), (b) and (c) are equivalent for \(p=1, 2, 3\). See the next section for a proof for the case \(p=3\) (the cases \(p=1,2\) are left as an exercise).

  • Surprisingly, for \(p \geq 4\), these definitions are not equivalent, and in fact (c) implies (b) implies (a). However, the converses are not true even for polynomial functions. In the “Counterexamples” section below, we give counterexamples to this effect.

  • From the definition through neighbhorhoods and the implications (c) implies (b) implies (a), it is immediate that if \(x\) is a local minimum of \(f\), then \(x\) is \(p\)-critical for all \(p\) (for all three definitions). We already saw that the converse is not true in general, even for \(d=1\). What if we restrict \(f\) to be real-analytic or even polynomial? For an answer, see the penultimate section of this blog.

Proof that (c) implies (b) implies (a)

It is immediate that (b) implies (a). (c) implies (b) is also straightforward. We assume \(f(x+v) \geq f(x) - C \|v\|^{p+1}\) for \(\|v\|\) small enough. Hence, for \(t\) small enough, we have \[\begin{align}\label{eq2}\tag{2} f(\gamma(t)) \geq f(x) - C \|\gamma(t) - x\|^{p+1}. \end{align}\] Taylor’s theorem tells us that \(\gamma(t) = \gamma(0) + \gamma'(0) t + o(t)\), and so \(\|\gamma(t) - x\|^{p+1} = \|\gamma'(0)\|^{p+1} \cdot t^{p+1} + o(t^{p+1})\). Plugging this into equation \(\eqref{eq2}\), we conclude \(f(\gamma(t)) \geq f(x) - C' t^{p+1}\) for \(t\) small enough (which is 1D \(p\)-criticality through neighborhoods).

Warning

A common mistake (one I’ve made before myself and which spurred this blog post!) is to assume that some or all of definitions (a), (b), (c) are equivalent.

Based on the equivalent 1D definitions of \(p\)-criticality, we might also suggest a fourth definition for arbitrary dimension \(d\):

  1. [through series] \(x\) is \(p\)-critical through series if \(f_p\) has a local minimum at \(x\).

However, this is an unsatisfying definition because it is NOT true that if \(x\) is a local minimum then \(x\) is \(p\)-critical through series, as the next counterexample indicates.

Counterexample: local minimum does not imply \(p\)-critical through series, for \(d\geq 2\)

Consider the 2D cubic polynomial \[ \begin{align}\label{counterexample}\tag{3} f(x,y) = y^2 - x^2 y - x y^2. \end{align} \] One checks \((0,0)\) is a \(3\)-critical point through neighborhoods (using tools from the next section). Yet, \((0,0)\) is not a local minimum of \(f\) (Exercise: why?).

So we know that there exists \(C > 0\) and \(\epsilon > 0\) such that \[g(x,y) := f(x,y) + C\|(x,y)\|^4 \geq 0\] for all \((x,y)\) with \(\|(x,y)\|\leq \epsilon\). In particular, the smooth function \(g\) has a local minimum at \((0,0)\). On the other hand, \(g_3 = f\) does not have a local minimum at \((0,0)\), and so \((0,0)\) is not \(3\)-critical through series for \(g\).

As an added bonus, we also see that if \(x\) is a \(3\)-critical point of a cubic polynomial (in this case \(f\)), it is NOT necessarily a local minimum (contrast this with the 1D case!).

Additional remarks
  • In Chapter 12 of their book, Cartis, Gould and Toint consider two definitions of \(p\)-critical points: “old” and “new”. The “old” definition is based on curves (like definition (b) above). The “new” definition is based on multivariate Taylor expansion.

  • The terminology “through straight lines, curves, neighborhoods” is nonstandard (we made it up). To be clear, we are not claiming that any of the material here is new (although if I do not attibute a source, then it means I have not seen it before). If any readers have useful suggestions about references, please let me know!

Third-order critical points (\(p=3\))

Let’s show that that definitions (a), (b), (c) are equivalent when \(p=3\). This is not true for \(p \geq 4\) — see the subsequent section.

Proposition 1

Definitions (a), (b), (c) [through straight lines, through curves, through neighborhoods] are equivalent for \(p=3\).

Proof of Proposition 1

We just need to show (a) implies (c). WLOG, assume \(0\) is 3-critical for \(f\) through straight lines, and \(f(0) = 0\). We want to show \(0\) is 3-critical for \(f\) through neighborhoods. We know that

  • \(\nabla f(0) = 0\) and \(\nabla^2 f(0) \succeq 0\) (since 3-criticality through lines implies 2-criticality),

  • and \(\D^3 f(0)[u, u, u]=0\) for all \(u \in \mathrm{ker}[\nabla^2 f(0)]\) (why?), where \(\mathrm{ker}[\nabla^2 f(0)]\) denotes the kernel of the Hessian.

Consider \(f_3\), the third-order Taylor approximation of \(f\): \[f_3(v) = \frac{1}{2} \langle v, \nabla^2 f(0)[v] \rangle + \frac{1}{6} \D^3 f(0)[v,v,v].\]
It is enough to show that \(0\) is 3-critical for \(f_3\) through neighborhoods, since \(|f(v) - f_3(v)| = O(\|v\|^4)\) by Taylor’s theorem.

Let \(v \in \reals^d\). Let \(u\) be the orthogonal projection of \(v\) onto \(\mathrm{ker}[\nabla^2 f(0)]\). Defining \(w = v - u\), we then have:

  • \(u\) and \(w\) are orthogonal, and \(\|v\|^2 = \|u\|^2 + \|w\|^2\);

  • \(\nabla^2 f(0) [w]\) is orthogonal to \(\mathrm{ker}[\nabla^2 f(0)]\);

  • \(\nabla^2 f(0)[w,w] \geq \lambda \|w\|^2,\) where \(\lambda > 0\) is the smallest nonzero eigenvalue of \(\nabla^2 f(0)\). (If \(\lambda\) doesn’t exist, then \(f_3 \equiv 0\) and we are done.)

Let \(\rho = \max_{u, w, s \in S^{d-1}} |\D^3 f(0)[u,w,s]|\), where \(S^{d-1}\) is the unit sphere in \(\reals^d\). Using \(\D^3 f(0)[u, u, u]=0\), we have \[ \begin{align} f_3(v) &= \frac{1}{2} \nabla^2 f(0)[w,w] + \frac{1}{6} \D^3 f(0)[u+w, u+w, u+w] \\ &\geq \frac{\lambda}{2} \|w\|^2 - \frac{\rho}{6} \|w\|^3 - \frac{\rho}{2} \|u\|^2 \cdot \|w\| - \frac{\rho}{2} \|u\| \cdot \|w\|^2. \end{align} \] So if \(\|v\| \leq \lambda / \rho\), then \(\|w\| \leq \lambda / \rho\) and so \[f_3(v) \geq \frac{\lambda}{4} [\|w\|^2 - \frac{2 \rho}{\lambda} \|u\|^2 \cdot \|w\| - \frac{2 \rho}{\lambda} \|u\| \cdot \|w\|^2].\] Using the inequality \(y^2 - a x^2 y - a x y^2 + a^2 (x^2+y^2)^2 \geq 0\) for all \(x, y \in \reals\) (short proof omitted), we conclude \(f_3(v) \geq - \frac{\lambda}{4}(\frac{2 \rho}{\lambda})^2 \|v\|^4\) for all \(\|v\| \leq \lambda / \rho\).

This result about \(3\)-criticality is well-known — see for example Section 3 in this paper of Anandkumar and Ge. In a similar spirit, Ahmadi and Zhang give a closely related characterization of local minima of cubic polynomials, e.g., see slide 9 here.

Counterexamples

Counterexample: (a) does NOT imply (b)

Consider the 2D polynomial \[f(x,y) = (x^2 + y^2 - 6 x) (y^2 - 8 x).\] This example is due to this MSE post. You can show that for every \(v \in \reals^2\), the 1D function \(t \mapsto f(t v)\) has a local minimum at \(t=0\) — see the MSE post for details. In particular, we conclude that, for all \(p \geq 1\), \((0,0)\) is \(p\)-critical point through lines for \(f\).

However, \((0,0)\) is NOT \(4\)-critical through curves — consider the curve \(t \mapsto (t^2/7, t)\). So we see that for \(p\geq 4\), being \(p\)-critical through lines and curves is not the same.

Counterexample: (b) does NOT imply (c)

Consider the 2D polynomial \[f(x,y) = (y^2-x^3)(y^2-4x^3)\] This example is due to this book of Udriste: see Theorem 8.8 in Chapter 1.8 and the subsequent notes.

It’s an exercise to show that for every \(C^2\) curve \(\gamma\) with \(\gamma(0) = 0\) and \(\gamma'(0) \neq 0\), the 1D function \(t \mapsto f(\gamma(t))\) has a local minimum at \(t=0\), yet \((0,0)\) is not a local minimum of \(f\). The intuition is that in order for \(\gamma(t) < f(0,0) = 0\), \(\gamma\) must be sandwiched between the curves \(y^2 = x^3\) and \(y^2 = 4 x^3\) (see Figure 2), i.e., \(\gamma\) should be something like say \(\gamma(t) = (t, \sqrt{2 t^3})\). But it’s impossible for this to happen and for \(\gamma\) to be \(C^2\) and satisfy \(\gamma'(0) \neq 0\).

We conclude in particular that for all \(p \geq 1\), \((0,0)\) is a \(p\)-critical point of \(f\) through curves. On the other hand, \(f\) is NOT \(6\)-critical through neighborhoods — consider \(f\) along the curve \(t \mapsto (t, \sqrt{2 t^3})\). So being \(p\)-critical through curves and neighborhoods is not the same (at least for \(p \geq 6\)).

Remark: This example also shows that to prove that a function has a local minimum, it is not enough to just check it has a local minimum when restricted to all \(C^2\) curves with nonzero velocity (let alone straight lines!). On the other hand, it is enough to check this is true for all \(C^1\) curves — see Theorem 8.8 of Udriste and also this paper of Udriste and coauthors.

Remark: One may expect that (b) does not imply (c) for bizarre non-analytic functions. However, it is particularly striking to me that the counterexample here is a low-degree polynomial.

Figure 2: Plot of the sublevel set \(\{(x,y) : (y^2-x^3)(y^2-4x^3) \leq 0 \}\) from (b) does NOT imply (c)

What is the “right” definition of \(p\)-criticality?

In the previous two counterexamples, we saw that there exists polynomial \(f\) such that \(0\) is \(p\)-critical through curves for all \(p \geq 1\) for \(f\), and yet \(0\) is not a local minimum of \(f\). Do such examples exist for \(p\)-criticality through neighborhoods? It turns out the answer is no!

Proposition 2
  • If \(f \colon \reals^d \to \reals\) is real-analytic and \(0\) is \(p\)-critical through neighborhoods for all \(p \geq 1\), then \(0\) is a local minimum of \(f\).

  • For each \(d \geq 1\) and \(k \geq 1\), there is a (finite) positive integer \(p(k,d)\) such that: if \(f \colon \reals^d \to \reals\) is a degree-\(k\) polynomial and \(0\) is \(p(k,d)\)-critical through neighborhoods, then \(0\) is a local minimum of \(f\).

For a proof of this proposition, you will have to wait for the next blog post! Note that in general \(p(k, d) > k\), as exhibited by counterexample \(\eqref{counterexample}\) — contrast this with the 1D case where \(p(k, 1) = k\).

So among the three definitions of \(p\)-criticality (through straight lines, curves, neighborhoods), which one the deserves the title of \(p\)-criticality (no “through …” qualifier)? Given Proposition 2 and the polynomial counterexamples, I argue that it is \(p\)-criticality through neighborhoods (and indeed this seems to be somewhat reflected in the literature). However, since \(p\)-criticality through neighborhoods is the strongest definition, it may be hard to find such points.

On the complexity of finding \(p\)-critical points

Anandkumar and Ge show that it is possible to efficiently find points which are \(3\)-critical through neighborhoods. In a related line of work, Ahmadi and Zhang investigate the complexity of finding local minima for polynomials (especially for cubic polynomials): see these slides and this paper.

On the other hand, Anandkumar and Ge show that even finding points which are \(4\)-critical through neighborhoods is NP-hard. This shows that optimization algorithms often have a hard time avoiding higher-order (aka flat) critical points. Therefore, often one tries to show a function under consideration does not have higher-order critical points: this leads us to the notion of Morse/Morse-Bott/Morse-Kirwan functions. Much can be said about these (and they are very useful) — perhaps the subject of future blog posts!


Acknowledgments: We thank Nicolas Boumal for several helpful suggestions and improvements.