Gradient flows and the Łojasiewicz inequality, with an application to higher-order criticality

optimization

gradient flows

Łojasiewicz inequality

higher-order derivatives

critical points

We show that every saddle point of a real-analytic function has a gradient flow line converging to it. We use this fact to prove a claim made in our previous blog post on higher-order criticality: if x is a p-critical point of a real-analytic function f for all positive integers p, then x is a local minimum of f. The key tools used throughout are gradient flows and the Łojasiewicz inequality.

Authors

Christopher Criscitiello (crisciti@wharton.upenn.edu)

Quentin Rebjock (quentin.rebjock@epfl.ch)

Published

October 3, 2024

Introduction

In our previous blog post, we considered several definitions of p-criticality (through straight lines, curves and neighborhoods), and showed none of these definitions are equivalent. Let’s recall them here:

Definitions of p-critical point

The point x \in \reals^d is p-critical for f through “straight lines”, “curves”, or “neighborhoods” if:

[through straight lines] For all v \in \mathbb{R}^d, the 1D function t \mapsto f(x + t v) has a p-th order critical point at t=0.
[through curves] For all C^p curves \gamma with \gamma(0) = x and \gamma'(0) \neq 0, the 1D function t \mapsto f(\gamma(t)) has a p-th order critical point at t=0.
[through neighborhoods] There is an \epsilon > 0 and C > 0 such that f(x + v) \geq f(x) - C \|v\|^{p+1} for all v \in \reals^d such that \|v\| \leq \epsilon. (This is definition 1 in this paper of Anandkumar and Ge.)

We concluded that the “right” definition of p-criticality is through neighborhoods, due to the following two properties (not enjoyed by the other definitions).

Proposition 1

If f \colon \reals^d \to \reals is real-analytic and 0 is p-critical through neighborhoods for all p \geq 1, then 0 is a local minimum of f.

Proposition 2

For each d \geq 1 and k \geq 1, there is a (finite) positive integer p(k,d) such that: if f \colon \reals^d \to \reals is a degree-k polynomial and 0 is p(k,d)-critical through neighborhoods, then 0 is a local minimum of f.

Counterexamples: Propositions 1 and 2

For Proposition 1, the real-analytic assumption is necessary, i.e., f being C^\infty is not enough — recall the 1D counterexample f(x) = x e^{-1/x^2} from our previous blog post.

For Proposition 2, it is easy to show that p(k, 1) = k works. However, for d > 1, p(k, d) = k is not enough — recall the counterexample f(x,y) = y^2 - x^2 y - x y^2 from our previous blog post. We argued that (0,0) is 3-critical for f, but it is not a local minimum (hence p(3, 2) > 3). Exercise: Show that p(3, 2) = 5 works.

Propositions 1 and 2 are consequences of the following result.

Proposition 3

Let f \colon \reals^d \to \reals be real-analytic, with f(0) = 0, \nabla f(0) = 0. If 0 is not a local minimum of f, then there exists x_0 \in \reals^d \setminus \{0\} such that positive gradient flow (+gradflow) started from x_0 converges to 0.

Proposition 3 is probably known — see these papers of Moussu, Nowel and Szafraniec, and Szafraniec. We present a proof at the end of this post. This proof may be new, and, more importantly, it is self-contained and hopefully easy to parse!

The key tools to prove Propositions 1, 2, 3 are gradient flows and the Łojasiewicz inequality.

Counterexample: Proposition 3

The assumption that f is real-analytic in Propostion 3 is necessary, i.e., it is not enough for f to be only C^\infty. Indeed, consider a smooth 1D function which oscillates infinitely often, like the following: f(x) = \sin(6\pi/x) e^{-1/x^2}. Non-constant gradient flow trajectories cannot converge to the origin because there are 1-critical points arbitrarily close to it (on both sides).

The function above is somewhat pathological as its set of 1-critical points is not a smooth manifold. What if we additionally assume the set of 1-critical points of f is a smooth manifold? Even this is not enough for Proposition 3 to hold: an example is furnished by the so-called “Mexican hat function” and variants thereof (see pages 4-5 in this paper by Absil, Mahony and Andrews).

Figure 1: Plot of f(x) = \sin(6\pi/x) e^{-1/x^2}.

Gradient flows

Let f \colon \reals^d \to \reals be C^\infty. A positive gradient flow trajectory started at x_0 \in \reals^d is a solution of the ODE \begin{align}\tag{$+$gradflow} x'(t) = \nabla f(x(t)), \quad \quad t \in \reals, \quad \quad x(0) = x_0. \end{align} Likewise, negative gradient flow is a solution of \begin{align}\tag{$-$gradflow} x'(t) = -\nabla f(x(t)), \quad \quad t \in \reals, \quad \quad x(0) = x_0. \end{align} A gradient flow is always defined on some open interval around t = 0; however, it may not be defined on all of \reals. If \nabla f(x_0) is nonzero then gradient flow never reaches a 1-critical point in finite time (since the flow map \Phi defined below is a diffeomorphism in an appropriate sense). Therefore, the quantity f(x(t)) is always increasing for positive gradient flow with \nabla f(x_0) \neq 0. This is because \frac{d}{dt}[f(x(t))] = \mathrm{D} f(x(t))[x'(t)] = \|\nabla f(x(t))\|^2 > 0 \quad \quad \text{for all $t$}. (And likewise the function value decreases for negative gradient flow.)

We define the positive gradient flow map \Phi by \begin{align}\tag{flowmap} \Phi(x_0, t) = x(t) \end{align} where x(\cdot) is the positive gradient flow starting at x_0. The flow map \Phi is defined on a (open) subset of \reals^d \times \reals (this domain is called the flow’s maximal flow domain). It is a fact that \Phi is differentiable on its domain. Further, for a given t, the map x \mapsto \Phi(x, t) is a diffeomorphism (from the set of points on which it is well defined and onto its image). For more information on vector-field flows more generally, we recommend Chapter 9 “Integral Curves and Flows” of Introduction to Smooth Manifolds by John Lee.

For any F \in \reals, let \tau(F, x_0) denote the first time t \geq 0 at which the positive gradient flow (+gradflow) leaves the sublevel set \{x \in \reals^d : f(x) < F\}, with the convention that \tau(F, x_0) = +\infty if the flow never leaves the sublevel set in finite time. As a shorthand, we also define \tau(x) := \tau(0, x). Note that if \nabla f(x_0) \neq 0 and \tau(F, x_0) is finite, then it is the only time t for which f(x(t)) = F, since f(x(t)) is increasing. For a given F, we can define a map x \mapsto \tau(F, x) from \reals^d to [0, \infty]. We have the following useful fact.

Lemma 1

Let f \colon \reals^d \to \reals be C^\infty, and let x_0 \in \reals^d, F \in \reals. Assume \tau(F, x_0) is finite. Then the map x \mapsto \tau(F, x) is finite and differentiable in a neighborhood of x_0.

Proof of Lemma 1

The main idea is to apply the implicit function theorem. At the end, we also sketch another approach based on rescaling/reparameterizing the gradient flow.

We know that (x, t) = (x_0, \tau(F, x_0)) is a solution of the equation f(\Phi(x, t)) = F recalling the definition (flowmap) of the flow map \Phi. To apply the implicit function theorem, we need to check that the t-differential of f(\Phi(x, t)) is nonzero at (x, t) = (x_0, \tau(F, x_0)). Indeed, \frac{d}{dt}[f(\Phi(x_0, t))]_{t = \tau(F, x_0)} = \mathrm{D} f(\Phi(x_0, \tau(F, x_0)))[x'(\tau(F, x_0))] = \|\nabla f(x(\tau(F, x_0)))\|^2, which is nonzero since gradient flow cannot reach a 1-critical point in finite time.

Hence applying the implicit function theorem, there is a neighborhood U of x_0 and a unique differentiable function T \colon U \to \reals such that f(\Phi(x, T(x))) = F, \quad \quad \forall x \in U. Due to the uniqueness of \tau mentioned previously, we conclude that \tau(F, x) = T(x), which proves the lemma.

For those curious, let’s sketch an alternative approach, employing a useful trick. Consider the rescaled positive gradient flow: x’(s) = \frac{\nabla f(x(s))}{\|\nabla f(x(s)\|^2}, \quad x(0) = x_0. Recall that multiplying the vector field of a flow by a positive function just amounts to reparameterizing time, so the flow lines of x(\cdot) do not change, just their time parameterization does.

This rescaled/reparameterized flow has the property that it crosses level sets at constant speed, as \frac{d}{ds}[f(x(s))] = 1. In particular, the time s at which the flow leaves the sublevel set \{x \in \reals^d : f(x) < F\} equals s = F - f(x_0), provided F \geq f(x_0) and we are careful to control where \nabla f vanishes. We can then conclude that x \mapsto \tau(F, x) is a differentiable function of s = F - f(x), which itself is smooth.

The Łojasiewicz inequality

In the 1960s, Stanisław Łojasiewicz proved the following theorem, which is an extremely useful tool for controlling gradient flows. (Notation: throughout, B(0, \delta) denotes the open ball of radius \delta centered at the origin.)

Theorem: Łojasiewicz Inequality

Let f \colon \reals^d \to \reals be real-analytic, with f(0) = 0. There exist \delta_L > 0, c > 0, \mu \in [0, 1), such that \begin{align}\tag{Ł} \|\nabla f(x)\| \geq c |f(x)|^{\mu} \quad \quad \forall x \in B(0, \delta_L) =: U_L. \end{align}

Inequality (Ł) is known as the Łojasiewicz inequality — its proof is beyond the scope of this blog.

If f satisfies the Łojasiewicz inequality, then its gradient flows cannot stray too far from their starting points.

Lemma 2: Łojasiewicz inequality implies growth

Assume f is C^\infty, has f(0) = 0, \nabla f(0) = 0, and satisfies the Łojasiewicz inequality (Ł) with constants c, \mu in neighborhood U_L = B(0, \delta_L). There is a neighborhood U_M = B(0, \delta_M), \delta_M \leq \delta_L/2, such that if x(t) is the positive gradient flow (+gradflow) started from x_0 \in U_M with f(x_0) < 0, then x(t) \in U_L \quad \quad \text{for all $t \in [0, \tau(x_0)]$} and moreover \begin{align} \tag{localization} \|x_0 - x(t)\| \leq (c(1-\mu))^{-1} |f(x_0)|^{1-\mu} \quad \quad \text{for all $t \in [0, \tau(x_0)]$}. \end{align}

In Lemma 2, \tau(x(0)) may be either finite or infinite. If it is infinite, then x(t) must converge to a critical point of f (e.g., see the exercise at the end of this section), and x(\tau(x(0))) is that limiting point. Either way, x(\tau(x(0))) lies on the level set \{x : f(x) = 0\}.

The proof of Lemma 2 is classical and originally due to Łojasiewicz. We present a proof based on Theorem 2.2 of this paper by Absil, Mahony and Andrews.

Proof of Lemma 2

Take any x(0) \in U_L, and let T be the first time the positive gradient flow starting at x(0) leaves the set \{x \in U_L : f(x) < 0\}. For t \in [0, T), we have f(x(t)) < 0 and \begin{align*} \frac{d}{dt}\Big[ (- f (x(t)))^{1-\mu} \Big] &= (1-\mu) (- f(x(t)))^{-\mu} \frac{d}{dt}[- f(x(t))] \\ &= -(1-\mu) |f(x(t))|^{-\mu} \mathrm{D} f(x(t))[x'(t)] \\ &= -(1-\mu) |f(x(t))|^{-\mu} \|\nabla f(x(t))\| \cdot \|x'(t)\| \\ &\leq -c(1-\mu) |f(x(t))|^{-\mu} |f(x(t))|^{\mu} \cdot \|x'(t)\| \\ &= -c(1-\mu) \|x'(t)\|. \end{align*} Therefore, for any t \in [0, T) we have \begin{align*}\tag{1} \|x(0) - x(t)\| &\leq \int_0^t \|x'(\tau)\| d \tau \leq (c(1-\mu))^{-1} \Big[|f(x(0))|^{1-\mu} - |f(x(t))|^{1-\mu}\Big] \\ &\leq (c(1-\mu))^{-1} |f(x(0))|^{1-\mu}. \end{align*} The first inequality is because \int_0^t \|x'(\tau)\| d \tau is the length of the curve x(\cdot) between x(0) and x(t).

By smoothness of f and f(0) = 0, there exists a \delta_M > 0 such that \begin{align}\tag{2} \|x\| + (c(1-\mu))^{-1} |f(x)|^{1-\mu} \leq \delta_L / 2, \quad \quad \forall x \in B(0, \delta_M). \end{align}

We claim that any positive gradient flow started at x(0) \in U_M stays in B(0, \delta_L/2) \subseteq U_L until time t = \tau(x(0)). For contradiction, assume not. Then for t close to T, x(t) is arbitrarily close to the boundary of U_L (i.e., \|x(t)\| is close to \delta_L). On the other hand, inequalities (1) and (2) imply \begin{align*} \|x(t)\| \leq \|x(0)\| + (c(1-\mu))^{-1} |f(x(0))|^{1-\mu} \leq \delta_L/2 \quad \quad \forall t \in [0, T) \end{align*} which is a contradiction, proving our claim.

We conclude that x(t) \in B(0, \delta_L/2) and equation (1) holds if x(0) \in U_M and t \in [0, \tau(x(0))). We leave it as a (simple) exercise to verify this also holds for t \in [0, \tau(x(0))].

Exercise: Assume f satisfies the Łojasiewicz inequality (Ł) with constants c, \mu in U_L = B(0, \delta_L). Also assume that the gradient flow line x(\cdot) started at x(0) \in U_L stays in U_L for all t \geq 0. Show that x(t) must converge to a critical point of f. [Hint: Use Lemma 2, with a compactness argument.]

Proofs of Propositions 1 and 2: When does p-critical imply local minimum?

We are now ready to prove Proposition 4 below. Propositions 1 and 2 are immediate consequences.

Proposition 4

Suppose f \colon \reals^d \to \reals is C^\infty with f(0) = 0, and satisfies the Łojasiewicz inequality (Ł) with constants c, \mu in neighborhood U_L. If 0 is p-critical through neighborhoods for some p > (1-\mu)^{-1} - 1, then 0 is a local minimum of f.

Proof of Proposition 4

By definition of p-criticality through neighborhoods, there is a \delta_p > 0, C_p > 0 such that \begin{align}\tag{$p$-criticality} f(x) - f(0) = f(x) \geq -C_p \|x\|^{p+1} \quad \quad \forall x \in B(0, \delta_p). \end{align} Suppose, for contradiction, that 0 is not a local minimum. Then by Proposition 3 there exists a x_0 \in U_L \cap B(0, \delta_p) such that positive gradient flow x(t) started from x(0) = x_0 converges to 0 (and stays in U_L \cap B(0, \delta_p) and has f(x(t)) < 0). In particular, invoking Lemma 2, we have \|x(t)\| \leq (c(1-\mu))^{-1} |f(x(t))|^{1-\mu} \quad \quad \forall t \geq 0. On the other hand, (p-criticality) says |f(x(t)| \leq C_p \|x(t)\|^{p+1}. Putting these together, we have for all t \geq 0: \|x(t)\| \leq (c(1-\mu))^{-1} C_p^{1-\mu} \|x(t)\|^{(p+1)(1-\mu)}. This is a contradiction since (p+1)(1-\mu) > 1 and \|x(t)\| \to 0.

Proposition 1 follows immediately from Proposition 4 and Łojasiewicz’s theorem previously stated.

Proposition 2 follows immediately from Proposition 4, and bounds on the best Łojasiewicz exponent \mu given by D’Acunto and Kurdyka. Specifically, D’Acunto and Kurdyka show that if f \colon \reals^d \to \reals is a degree-k polynomial, then f satisfies the Łojasiewicz inequality with exponenent \mu = 1 - (3 k)^{-d}.

Proof of Proposition 3: every saddle has a gradient flow line converging to it

It remains to only prove Proposition 3. The idea is simple. Choose a sequence of points y_0, y_1, \ldots converging to 0 with f(y_k) < 0. For each y_k we look at the positive gradient flow line y_k(\cdot) starting at y_k until it leaves the sublevel set \{x : f(x) < 0\}. By Lemma 2, the endpoints y_k(\tau(y_k)) of the flow lines also converge to 0.

Now run the negative gradient flow starting from y_k until it first hits the boundary of B(0, \delta_x), \delta_x > 0; call the intersection point x_k. Then we know that the positive gradient flow line starting at x_k ends at y_k(\tau(y_k)), which converges to 0 as k \to \infty.

The sequence x_0, x_1, \ldots accumulates to some x_\infty on the boundary of B(0, \delta_x). It stands to reason that the positive gradient flow line started from x_\infty converges to 0. This is indeed true. Let’s look at the details.

Proof of Proposition 3

We know that f satisfyies the Łojasiewicz inequality (Ł) with constants c, \mu in a neighborhood U_L = B(0, \delta_L). By Lemma 2, we know there is a neighborhood U_M = B(0, \delta_M), \delta_M \leq \delta_L/2, such that if x(0) \in U_M and f(x(0)) < 0 then x(t) \in U_L for all t \in [0, \tau(x(0))].

Constructing the sequence \mathbf{y_0, y_1, \ldots \to 0}: Since 0 is not a local minimum, there is a sequence of points y_0, y_1, \ldots in U_M which converges to 0, and satisfies f(y_k) < 0 for all k. Let y_k(t) denote positive gradient flow started at y_k(0) = y_k. Equation (localization) of Lemma 2, tells us that the entire trajectories \{y_k(t) : t \in [0, \tau(y_k)]\} converge to 0, since y_k \to 0 and f(y_k) \to 0. In particular, we conclude f(y_k(\tau(y_k))) = 0, and y_k(\tau(y_k)) converges to 0 as k \to \infty.

Constructing the sequence \mathbf{x_0, x_1, \ldots \to x_\infty}: Consider the negative gradient flow lines started from each y_k. Observe that each of these flows must eventually leave the ball U_L (otherwise they are trapped and must converge to a critical point, which must have function value 0 by the Łojasiewicz inequality, a contradiction). Fix \delta_x \in (\delta_M, \delta_L). For each k, let x_k be the first point where this negative gradient flow crosses the boundary of B(0, \delta_x) (so \|x_k\| = \delta_x).

Let x_k(t) denote the positive gradient flow starting from x_k(0) = x_k. Without loss of generality (by compactness and taking a convergent subsequence), we can assume x_0, x_1, x_2, \ldots converge to some x_\infty (of course with \|x_\infty\| = \delta_x). Let’s summarize what we know about the positive gradient flows x_k(t):

Each flow starts at x_k(0) = x_k on the boundary of B(0, \delta_x), and f(x_k) < 0.
x_k(t) \in U_L for all k \geq 0 and t \in [0, \tau(x(0))].
x_k(\tau(x_k)) = y_k(\tau(y_k)) converges to 0, and f(x_k(\tau(x_k))) = 0 for all k \geq 0.

Finishing up: To conclude, we want to show that x_\infty(t), the positive gradient flow starting from x_\infty, converges to 0 as t \to \tau(x_\infty). That is, we want to show that for each \epsilon > 0, there is a T \in (0,\tau(x_\infty)) such that \|x_\infty(t)\| \leq \epsilon for all t \in [T, \tau(x_\infty)).

So let \epsilon \in (0, \delta_x). Choose K big enough so that \|x_k(\tau(x_k))\| \leq \epsilon / 4 for all k \geq K (which we can do by property iii). Equation (localization) of Lemma 2 gives us \delta_x / 2 < \delta_x - \epsilon/4 \leq \|x_k - x_k(\tau(x_k))\| \leq (c(1-\mu))^{-1}|f(x_k)|^{1-\mu} for all k \geq K. In other words, this gives us a lower bound on |f(x_k)| in terms of \delta_x.

Define F_\epsilon < 0 through (c(1-\mu))^{-1} |F_\epsilon|^{1-\mu} = \epsilon / 4. From this and \epsilon / 4 < \delta_x / 2, we find that |F_\epsilon| < |f(x_k)| for all k \geq K, and so 0 < \tau(F_\epsilon, x_k) < \tau(x_k) for k \geq K. Additionally, equation (localization) implies \begin{align}\tag{3} \|x_k(\tau(F_\epsilon, x_k))\| \leq \|x_k(\tau(x_k))\| + (c(1-\mu))^{-1} |F_\epsilon|^{1-\mu} \leq \epsilon/4 + \epsilon/4 = \epsilon/2 \quad \quad \forall k \geq K. \end{align}

By continuity of the flow map \Phi and continuity of x \mapsto \tau(F_\epsilon, x) (by Lemma 1), we have \lim_{k \to \infty} \Phi(x_k, \tau(F_\epsilon, x_k)) = \Phi(x_\infty, \tau(F_\epsilon, x_\infty)) = x_{\infty}(\tau(F_\epsilon, x_\infty)), and \|x_{\infty}(\tau(F_\epsilon, x_\infty))\| \leq \epsilon/2 by (3). Invoking equation (localization) one more time, we conclude \|x_{\infty}(t)\| \leq \|x_{\infty}(\tau(F_\epsilon, x_\infty))\| + (c(1-\mu))^{-1} |F_\epsilon|^{1-\mu} \leq \epsilon / 2 + \epsilon / 4 \leq \epsilon for all t \in [\tau(F_\epsilon, x_\infty), \tau(x_\infty)), as desired.

How to cite

If you would like to cite this post, you can use:

@misc{criscitiello2024gradientflowslojasiewicz,
  author       = {Christopher Criscitiello and Quentin Rebjock},
  title        = {Gradient flows and the {\L}ojasiewicz inequality, with an application to higher-order criticality},
  year         = {2024},
  month        = oct,
  howpublished = {\url{https://ccriscitiello.github.io/downhillblog/posts/Loja_p_implies_local/}},
  note         = {Blog post, published October 3, 2024}
}

Acknowledgements: We thank Nicolas Boumal for several helpful pointers.