Gradient flows and the Łojasiewicz inequality, with an application to higher-order criticality

optimization

gradient flows

Łojasiewicz inequality

higher-order derivatives

critical points

We show that every saddle point of a real-analytic function has a gradient flow line converging to it. We use this fact to prove a claim made in our previous blog post on higher-order criticality: if $x$ is a $p$-critical point of a real-analytic function $f$ for all positive integers $p$, then $x$ is a local minimum of $f$. The key tools used throughout are gradient flows and the Łojasiewicz inequality.

Authors

Christopher Criscitiello (christopher.criscitiello@epfl.ch)

Quentin Rebjock (quentin.rebjock@epfl.ch)

Published

October 3, 2024

Introduction

In our previous blog post, we considered several definitions of $p$-criticality (through straight lines, curves and neighborhoods), and showed none of these definitions are equivalent. Let’s recall them here:

Definitions of $p$-critical point

The point $x \in \reals^d$ is $p$-critical for $f$ through “straight lines”, “curves”, or “neighborhoods” if:

[through straight lines] For all $v \in \mathbb{R}^d$, the 1D function $t \mapsto f(x + t v)$ has a $p$-th order critical point at $t=0$.
[through curves] For all $C^p$ curves $\gamma$ with $\gamma(0) = x$ and $\gamma'(0) \neq 0$, the 1D function $t \mapsto f(\gamma(t))$ has a $p$-th order critical point at $t=0$.
[through neighborhoods] There is an $\epsilon > 0$ and $C > 0$ such that $f(x + v) \geq f(x) - C \|v\|^{p+1}$ for all $v \in \reals^d$ such that $\|v\| \leq \epsilon$. (This is definition 1 in this paper of Anandkumar and Ge.)

We concluded that the “right” definition of $p$-criticality is through neighborhoods, due to the following two properties (not enjoyed by the other definitions).

Proposition 1

If $f \colon \reals^d \to \reals$ is real-analytic and $0$ is $p$-critical through neighborhoods for all $p \geq 1$, then $0$ is a local minimum of $f$.

Proposition 2

For each $d \geq 1$ and $k \geq 1$, there is a (finite) positive integer $p(k,d)$ such that: if $f \colon \reals^d \to \reals$ is a degree-$k$ polynomial and $0$ is $p(k,d)$-critical through neighborhoods, then $0$ is a local minimum of $f$.

Counterexamples: Propositions 1 and 2

For Proposition 1, the real-analytic assumption is necessary, i.e., $f$ being $C^\infty$ is not enough — recall the 1D counterexample $f(x) = x e^{-1/x^2}$ from our previous blog post.

For Proposition 2, it is easy to show that $p(k, 1) = k$ works. However, for $d > 1$, $p(k, d) = k$ is not enough — recall the counterexample \[f(x,y) = y^2 - x^2 y - x y^2\] from our previous blog post. We argued that $(0,0)$ is $3$-critical for $f$, but it is not a local minimum (hence $p(3, 2) > 3$). Exercise: Show that $p(3, 2) = 5$ works.

Propositions 1 and 2 are consequences of the following result.

Proposition 3

Let $f \colon \reals^d \to \reals$ be real-analytic, with $f(0) = 0, \nabla f(0) = 0$. If $0$ is not a local minimum of $f$, then there exists $x_0 \in \reals^d \setminus \{0\}$ such that positive gradient flow $\eqref{posgradflow}$ started from $x_0$ converges to $0$.

Proposition 3 is probably known — see these papers of Moussu, Nowel and Szafraniec, and Szafraniec. We present a proof at the end of this post. This proof may be new, and, more importantly, it is self-contained and hopefully easy to parse!

The key tools to prove Propositions 1, 2, 3 are gradient flows and the Łojasiewicz inequality.

Counterexample: Proposition 3

The assumption that $f$ is real-analytic in Propostion 3 is necessary, i.e., it is not enough for $f$ to be only $C^\infty$. Indeed, consider a smooth 1D function which oscillates infinitely often, like the following: \[f(x) = \sin(6\pi/x) e^{-1/x^2}.\] Non-constant gradient flow trajectories cannot converge to the origin because there are $1$-critical points arbitrarily close to it (on both sides).

The function above is somewhat pathological as its set of $1$-critical points is not a smooth manifold. What if we additionally assume the set of $1$-critical points of $f$ is a smooth manifold? Even this is not enough for Proposition 3 to hold: an example is furnished by the so-called “Mexican hat function” and variants thereof (see pages 4-5 in this paper by Absil, Mahony and Andrews).

Figure 1: Plot of $f(x) = \sin(6\pi/x) e^{-1/x^2}$.

Gradient flows

Let $f \colon \reals^d \to \reals$ be $C^\infty$. A positive gradient flow trajectory started at $x_0 \in \reals^d$ is a solution of the ODE \[ \begin{align}\label{posgradflow}\tag{$+\text{gradflow}$} x'(t) = \nabla f(x(t)), \quad \quad t \in \reals, \quad \quad x(0) = x_0. \end{align} \] Likewise, negative gradient flow is a solution of \[ \begin{align}\label{neggradflow}\tag{$-\text{gradflow}$} x'(t) = -\nabla f(x(t)), \quad \quad t \in \reals, \quad \quad x(0) = x_0. \end{align} \] A gradient flow is always defined on some open interval around $t = 0$; however, it may not be defined on all of $\reals$. If $\nabla f(x_0)$ is nonzero then gradient flow never reaches a $1$-critical point in finite time (since the flow map $\Phi$ defined below is a diffeomorphism in an appropriate sense). Therefore, the quantity $f(x(t))$ is always increasing for positive gradient flow with $\nabla f(x_0) \neq 0$. This is because \[\frac{d}{dt}[f(x(t))] = \D f(x(t))[x'(t)] = \|\nabla f(x(t))\|^2 > 0 \quad \quad \text{for all $t$}.\] (And likewise the function value decreases for negative gradient flow.)

We define the positive gradient flow map $\Phi$ by \[ \begin{align}\label{flowmap}\tag{flowmap} \Phi(x_0, t) = x(t) \end{align} \] where $x(\cdot)$ is the positive gradient flow starting at $x_0$. The flow map $\Phi$ is defined on a (open) subset of $\reals^d \times \reals$ (this domain is called the flow’s maximal flow domain). It is a fact that $\Phi$ is differentiable on its domain. Further, for a given $t$, the map $x \mapsto \Phi(x, t)$ is a diffeomorphism (from the set of points on which it is well defined and onto its image). For more information on vector-field flows more generally, we recommend Chapter 9 “Integral Curves and Flows” of Introduction to Smooth Manifolds by John Lee.

For any $F \in \reals$, let $\tau(F, x_0)$ denote the first time $t \geq 0$ at which the positive gradient flow $\eqref{posgradflow}$ leaves the sublevel set \[\{x \in \reals^d : f(x) < F\},\] with the convention that $\tau(F, x_0) = +\infty$ if the flow never leaves the sublevel set in finite time. As a shorthand, we also define \[\tau(x) := \tau(0, x).\] Note that if $\nabla f(x_0) \neq 0$ and $\tau(F, x_0)$ is finite, then it is the only time $t$ for which $f(x(t)) = F$, since $f(x(t))$ is increasing. For a given $F$, we can define a map $x \mapsto \tau(F, x)$ from $\reals^d$ to $[0, \infty]$. We have the following useful fact.

Lemma 1

Let $f \colon \reals^d \to \reals$ be $C^\infty$, and let $x_0 \in \reals^d, F \in \reals$. Assume $\tau(F, x_0)$ is finite. Then the map \[x \mapsto \tau(F, x)\] is finite and differentiable in a neighborhood of $x_0$.

Proof of Lemma 1

The main idea is to apply the implicit function theorem. At the end, we also sketch another approach based on rescaling/reparameterizing the gradient flow.

We know that $(x, t) = (x_0, \tau(F, x_0))$ is a solution of the equation \[f(\Phi(x, t)) = F\] recalling the definition $\eqref{flowmap}$ of the flow map $\Phi$. To apply the implicit function theorem, we need to check that the $t$-differential of $f(\Phi(x, t))$ is nonzero at $(x, t) = (x_0, \tau(F, x_0))$. Indeed, \[\frac{d}{dt}[f(\Phi(x_0, t))]_{t = \tau(F, x_0)} = \D f(\Phi(x_0, \tau(F, x_0)))[x'(\tau(F, x_0))] = \|\nabla f(x(\tau(F, x_0)))\|^2,\] which is nonzero since gradient flow cannot reach a $1$-critical point in finite time.

Hence applying the implicit function theorem, there is a neighborhood $U$ of $x_0$ and a unique differentiable function $T \colon U \to \reals$ such that \[f(\Phi(x, T(x))) = F, \quad \quad \forall x \in U.\] Due to the uniqueness of $\tau$ mentioned previously, we conclude that $\tau(F, x) = T(x)$, which proves the lemma.

For those curious, let’s sketch an alternative approach, employing a useful trick. Consider the rescaled positive gradient flow: \[x’(s) = \frac{\nabla f(x(s))}{\|\nabla f(x(s)\|^2}, \quad x(0) = x_0.\] Recall that multiplying the vector field of a flow by a positive function just amounts to reparameterizing time, so the flow lines of $x(\cdot)$ do not change, just their time parameterization does.

This rescaled/reparameterized flow has the property that it crosses level sets at constant speed, as \[\frac{d}{ds}[f(x(s))] = 1.\] In particular, the time $s$ at which the flow leaves the sublevel set $\{x \in \reals^d : f(x) < F\}$ equals $s = F - f(x_0)$, provided $F \geq f(x_0)$ and we are careful to control where $\nabla f$ vanishes. We can then conclude that $x \mapsto \tau(F, x)$ is a differentiable function of $s = F - f(x)$, which itself is smooth.

The Łojasiewicz inequality

In the 1960s, Stanisław Łojasiewicz proved the following theorem, which is an extremely useful tool for controlling gradient flows. (Notation: throughout, $B(0, \delta)$ denotes the open ball of radius $\delta$ centered at the origin.)

Theorem: Łojasiewicz Inequality

Let $f \colon \reals^d \to \reals$ be real-analytic, with $f(0) = 0$. There exist $\delta_L > 0, c > 0, \mu \in [0, 1)$, such that \[ \begin{align}\label{Lojasiewiczinequality}\tag{Ł} \|\nabla f(x)\| \geq c |f(x)|^{\mu} \quad \quad \forall x \in B(0, \delta_L) =: U_L. \end{align} \]

Inequality $\eqref{Lojasiewiczinequality}$ is known as the Łojasiewicz inequality — its proof is beyond the scope of this blog.

If $f$ satisfies the Łojasiewicz inequality, then its gradient flows cannot stray too far from their starting points.

Lemma 2: Łojasiewicz inequality implies growth

Assume $f$ is $C^\infty$, has $f(0) = 0, \nabla f(0) = 0$, and satisfies the Łojasiewicz inequality $\eqref{Lojasiewiczinequality}$ with constants $c, \mu$ in neighborhood $U_L = B(0, \delta_L)$. There is a neighborhood $U_M = B(0, \delta_M), \delta_M \leq \delta_L/2,$ such that if $x(t)$ is the positive gradient flow $\eqref{posgradflow}$ started from $x_0 \in U_M$ with $f(x_0) < 0$, then \[x(t) \in U_L \quad \quad \text{for all $t \in [0, \tau(x_0)]$}\] and moreover \[ \begin{align} \label{star1eqn} \tag{localization} \|x_0 - x(t)\| \leq (c(1-\mu))^{-1} |f(x_0)|^{1-\mu} \quad \quad \text{for all $t \in [0, \tau(x_0)]$}. \end{align} \]

In Lemma 2, $\tau(x(0))$ may be either finite or infinite. If it is infinite, then $x(t)$ must converge to a critical point of $f$ (e.g., see the exercise at the end of this section), and $x(\tau(x(0)))$ is that limiting point. Either way, $x(\tau(x(0)))$ lies on the level set $\{x : f(x) = 0\}$.

The proof of Lemma 2 is classical and originally due to Łojasiewicz. We present a proof based on Theorem 2.2 of this paper by Absil, Mahony and Andrews.

Proof of Lemma 2

Take any $x(0) \in U_L$, and let $T$ be the first time the positive gradient flow starting at $x(0)$ leaves the set $\{x \in U_L : f(x) < 0\}$. For $t \in [0, T)$, we have $f(x(t)) < 0$ and \[ \begin{align} \frac{d}{dt}\Big[ (- f (x(t)))^{1-\mu} \Big] &= (1-\mu) (- f(x(t)))^{-\mu} \frac{d}{dt}[- f(x(t))] \\ &= -(1-\mu) |f(x(t))|^{-\mu} \D f(x(t))[x'(t)] \\ &= -(1-\mu) |f(x(t))|^{-\mu} \|\nabla f(x(t))\| \cdot \|x'(t)\| \\ &\leq -c(1-\mu) |f(x(t))|^{-\mu} |f(x(t))|^{\mu} \cdot \|x'(t)\| \\ &= -c(1-\mu) \|x'(t)\|. \end{align} \] Therefore, for any $t \in [0, T)$ we have \[ \begin{align}\label{usefulinequality}\tag{1} \|x(0) - x(t)\| &\leq \int_0^t \|x'(\tau)\| d \tau \leq (c(1-\mu))^{-1} \Big[|f(x(0))|^{1-\mu} - |f(x(t))|^{1-\mu}\Big] \\ &\leq (c(1-\mu))^{-1} |f(x(0))|^{1-\mu}. \end{align} \] The first inequality is because $\int_0^t \|x'(\tau)\| d \tau$ is the length of the curve $x(\cdot)$ between $x(0)$ and $x(t)$.

By smoothness of $f$ and $f(0) = 0$, there exists a $\delta_M > 0$ such that \[ \begin{align}\label{deltaM}\tag{2} \|x\| + (c(1-\mu))^{-1} |f(x)|^{1-\mu} \leq \delta_L / 2, \quad \quad \forall x \in B(0, \delta_M). \end{align} \] We claim that any positive gradient flow started at $x(0) \in U_M$ stays in $B(0, \delta_L/2) \subseteq U_L$ until time $t = \tau(x(0))$. For contradiction, assume not. Then for $t$ close to $T$, $x(t)$ is arbitrarily close to the boundary of $U_L$ (i.e., $\|x(t)\|$ is close to $\delta_L$). On the other hand, inequalities $\eqref{usefulinequality}$ and $\eqref{deltaM}$ imply \[ \begin{align} \|x(t)\| \leq \|x(0)\| + (c(1-\mu))^{-1} |f(x(0))|^{1-\mu} \leq \delta_L/2 \quad \quad \forall t \in [0, T) \end{align} \] which is a contradiction, proving our claim.

We conclude that $x(t) \in B(0, \delta_L/2)$ and equation $\eqref{usefulinequality}$ holds if $x(0) \in U_M$ and $t \in [0, \tau(x(0)))$. We leave it as a (simple) exercise to verify this also holds for $t \in [0, \tau(x(0))]$.

Exercise: Assume $f$ satisfies the Łojasiewicz inequality $\eqref{Lojasiewiczinequality}$ with constants $c, \mu$ in $U_L = B(0, \delta_L)$. Also assume that the gradient flow line $x(\cdot)$ started at $x(0) \in U_L$ stays in $U_L$ for all $t \geq 0$. Show that $x(t)$ must converge to a critical point of $f$. [Hint: Use Lemma 2, with a compactness argument.]

Proofs of Propositions 1 and 2: When does $p$-critical imply local minimum?

We are now ready to prove Proposition 4 below. Propositions 1 and 2 are immediate consequences.

Proposition 4

Suppose $f \colon \reals^d \to \reals$ is $C^\infty$ with $f(0) = 0$, and satisfies the Łojasiewicz inequality $\eqref{Lojasiewiczinequality}$ with constants $c, \mu$ in neighborhood $U_L$. If $0$ is $p$-critical through neighborhoods for some $p > (1-\mu)^{-1} - 1$, then $0$ is a local minimum of $f$.

Proof of Proposition 4

By definition of $p$-criticality through neighborhoods, there is a $\delta_p > 0, C_p > 0$ such that \[ \begin{align}\label{pcriticality}\tag{$p$-criticality} f(x) - f(0) = f(x) \geq -C_p \|x\|^{p+1} \quad \quad \forall x \in B(0, \delta_p). \end{align} \] Suppose, for contradiction, that $0$ is not a local minimum. Then by Proposition 3 there exists a $x_0 \in U_L \cap B(0, \delta_p)$ such that positive gradient flow $x(t)$ started from $x(0) = x_0$ converges to $0$ (and stays in $U_L \cap B(0, \delta_p)$ and has $f(x(t)) < 0$). In particular, invoking Lemma 2, we have \[\|x(t)\| \leq (c(1-\mu))^{-1} |f(x(t))|^{1-\mu} \quad \quad \forall t \geq 0.\] On the other hand, $\eqref{pcriticality}$ says \[|f(x(t)| \leq C_p \|x(t)\|^{p+1}.\] Putting these together, we have for all $t \geq 0$: \[\|x(t)\| \leq (c(1-\mu))^{-1} C_p^{1-\mu} \|x(t)\|^{(p+1)(1-\mu)}.\] This is a contradiction since $(p+1)(1-\mu) > 1$ and $\|x(t)\| \to 0$.

Proposition 1 follows immediately from Proposition 4 and Łojasiewicz’s theorem previously stated.

Proposition 2 follows immediately from Proposition 4, and bounds on the best Łojasiewicz exponent $\mu$ given by D’Acunto and Kurdyka. Specifically, D’Acunto and Kurdyka show that if $f \colon \reals^d \to \reals$ is a degree-$k$ polynomial, then $f$ satisfies the Łojasiewicz inequality with exponenent $\mu = 1 - (3 k)^{-d}$.

Proof of Proposition 3: every saddle has a gradient flow line converging to it

It remains to only prove Proposition 3. The idea is simple. Choose a sequence of points $y_0, y_1, \ldots$ converging to $0$ with $f(y_k) < 0$. For each $y_k$ we look at the positive gradient flow line $y_k(\cdot)$ starting at $y_k$ until it leaves the sublevel set $\{x : f(x) < 0\}$. By Lemma 2, the endpoints $y_k(\tau(y_k))$ of the flow lines also converge to 0.

Now run the negative gradient flow starting from $y_k$ until it first hits the boundary of $B(0, \delta_x), \delta_x > 0$; call the intersection point $x_k$. Then we know that the positive gradient flow line starting at $x_k$ ends at $y_k(\tau(y_k))$, which converges to $0$ as $k \to \infty$.

The sequence $x_0, x_1, \ldots$ accumulates to some $x_\infty$ on the boundary of $B(0, \delta_x)$. It stands to reason that the positive gradient flow line started from $x_\infty$ converges to $0$. This is indeed true. Let’s look at the details.

Proof of Proposition 3

We know that $f$ satisfyies the Łojasiewicz inequality $\eqref{Lojasiewiczinequality}$ with constants $c, \mu$ in a neighborhood $U_L = B(0, \delta_L)$. By Lemma 2, we know there is a neighborhood $U_M = B(0, \delta_M), \delta_M \leq \delta_L/2,$ such that if $x(0) \in U_M$ and $f(x(0)) < 0$ then $x(t) \in U_L$ for all $t \in [0, \tau(x(0))]$.

Constructing the sequence $\mathbf{y_0, y_1, \ldots \to 0}$: Since $0$ is not a local minimum, there is a sequence of points $y_0, y_1, \ldots$ in $U_M$ which converges to $0$, and satisfies $f(y_k) < 0$ for all $k$. Let $y_k(t)$ denote positive gradient flow started at $y_k(0) = y_k$. Equation $\eqref{star1eqn}$ of Lemma 2, tells us that the entire trajectories $\{y_k(t) : t \in [0, \tau(y_k)]\}$ converge to $0$, since $y_k \to 0$ and $f(y_k) \to 0$. In particular, we conclude $f(y_k(\tau(y_k))) = 0$, and $y_k(\tau(y_k))$ converges to $0$ as $k \to \infty$.

Constructing the sequence $\mathbf{x_0, x_1, \ldots \to x_\infty}$: Consider the negative gradient flow lines started from each $y_k$. Observe that each of these flows must eventually leave the ball $U_L$ (otherwise they are trapped and must converge to a critical point, which must have function value 0 by the Łojasiewicz inequality, a contradiction). Fix $\delta_x \in (\delta_M, \delta_L)$. For each $k$, let $x_k$ be the first point where this negative gradient flow crosses the boundary of $B(0, \delta_x)$ (so $\|x_k\| = \delta_x$).

Let $x_k(t)$ denote the positive gradient flow starting from $x_k(0) = x_k$. Without loss of generality (by compactness and taking a convergent subsequence), we can assume $x_0, x_1, x_2, \ldots$ converge to some $x_\infty$ (of course with $\|x_\infty\| = \delta_x$). Let’s summarize what we know about the positive gradient flows $x_k(t)$:

Each flow starts at $x_k(0) = x_k$ on the boundary of $B(0, \delta_x)$, and $f(x_k) < 0$.
$x_k(t) \in U_L$ for all $k \geq 0$ and $t \in [0, \tau(x(0))]$.
$x_k(\tau(x_k)) = y_k(\tau(y_k))$ converges to $0$, and $f(x_k(\tau(x_k))) = 0$ for all $k \geq 0$.

Finishing up: To conclude, we want to show that $x_\infty(t)$, the positive gradient flow starting from $x_\infty$, converges to $0$ as $t \to \tau(x_\infty)$. That is, we want to show that for each $\epsilon > 0$, there is a $T \in (0,\tau(x_\infty))$ such that $\|x_\infty(t)\| \leq \epsilon$ for all $t \in [T, \tau(x_\infty))$.

So let $\epsilon \in (0, \delta_x)$. Choose $K$ big enough so that $\|x_k(\tau(x_k))\| \leq \epsilon / 4$ for all $k \geq K$ (which we can do by property iii). Equation $\eqref{star1eqn}$ of Lemma 2 gives us \[\delta_x / 2 < \delta_x - \epsilon/4 \leq \|x_k - x_k(\tau(x_k))\| \leq (c(1-\mu))^{-1}|f(x_k)|^{1-\mu}\] for all $k \geq K$. In other words, this gives us a lower bound on $|f(x_k)|$ in terms of $\delta_x$.

Define $F_\epsilon < 0$ through \[(c(1-\mu))^{-1} |F_\epsilon|^{1-\mu} = \epsilon / 4.\] From this and $\epsilon / 4 < \delta_x / 2$, we find that $|F_\epsilon| < |f(x_k)|$ for all $k \geq K$, and so $0 < \tau(F_\epsilon, x_k) < \tau(x_k)$ for $k \geq K$. Additionally, equation $\eqref{star1eqn}$ implies \[ \begin{align}\label{star2eqn}\tag{3} \|x_k(\tau(F_\epsilon, x_k))\| \leq \|x_k(\tau(x_k))\| + (c(1-\mu))^{-1} |F_\epsilon|^{1-\mu} \leq \epsilon/4 + \epsilon/4 = \epsilon/2 \quad \quad \forall k \geq K. \end{align} \]

By continuity of the flow map $\Phi$ and continuity of $x \mapsto \tau(F_\epsilon, x)$ (by Lemma 1), we have \[ \lim_{k \to \infty} \Phi(x_k, \tau(F_\epsilon, x_k)) = \Phi(x_\infty, \tau(F_\epsilon, x_\infty)) = x_{\infty}(\tau(F_\epsilon, x_\infty)), \] and $\|x_{\infty}(\tau(F_\epsilon, x_\infty))\| \leq \epsilon/2$ by $\eqref{star2eqn}$. Invoking equation $\eqref{star1eqn}$ one more time, we conclude \[\|x_{\infty}(t)\| \leq \|x_{\infty}(\tau(F_\epsilon, x_\infty))\| + (c(1-\mu))^{-1} |F_\epsilon|^{1-\mu} \leq \epsilon / 2 + \epsilon / 4 \leq \epsilon\] for all $t \in [\tau(F_\epsilon, x_\infty), \tau(x_\infty))$, as desired.

Acknowledgements: We thank Nicolas Boumal for several helpful pointers.