What is the difference between gradient descent and Newton's method?

Gradient descent uses only the gradient to update x and typically has linear or sublinear convergence, while Newton's method uses the Hessian to form a quadratic model, often achieving quadratic convergence but at higher cost due to Hessian computations.

How should I choose a step size?

Options include fixed alpha, diminishing schedules, backtracking with Armijo condition, exact line search, or adaptive schemes. The choice depends on problem conditioning and the cost of function/gradient evaluations.

When should I use accelerated methods like momentum or CG?

Use momentum or Nesterov when you want faster convergence on smooth problems; CG is particularly effective for quadratic or near-quadratic objectives, while BFGS offers a practical quasi-Newton alternative without computing the full Hessian.

gradient-methods

npx machina-cli add skill parcadei/Continuous-Claude-v3/gradient-methods --openclaw

Files (1)

SKILL.md

4.0 KB

Gradient Methods

When to Use

Use this skill when working on gradient-methods problems in optimization.

Decision Tree

Basic Gradient Descent
- Update: x_{k+1} = x_k - alpha * grad f(x_k)
- Step size alpha: fixed, diminishing, or line search
- Convergence: O(1/k) for convex, linear for strongly convex

Step Size Selection

Method	Approach
Fixed	alpha constant (requires tuning)
Backtracking	Armijo condition: f(x - alphagrad) <= f(x) - calpha*
Exact line search	minimize f(x - alpha*grad) over alpha
Adaptive	Adam, RMSprop (ML applications)

Accelerated Methods
- Momentum: add velocity term
- Nesterov: look-ahead gradient
- Conjugate gradient: for quadratic functions
- scipy.optimize.minimize(f, x0, method='CG') - conjugate gradient
Newton's Method
- Update: x_{k+1} = x_k - H^{-1} * grad f
- Requires Hessian (expensive but quadratic convergence)
- Quasi-Newton (BFGS): approximate Hessian
- scipy.optimize.minimize(f, x0, method='BFGS')
Convergence Diagnostics
- Monitor ||grad f|| < tolerance
- Check function value decrease
- Watch for oscillation (step size too large)
- sympy_compute.py diff "f" --var x for gradient

Tool Commands

Scipy_Bfgs

uv run python -c "from scipy.optimize import minimize; res = minimize(lambda x: (x[0]-1)**2 + 100*(x[1]-x[0]**2)**2, [0, 0], method='BFGS'); print('Rosenbrock min at', res.x)"

Scipy_Cg

uv run python -c "from scipy.optimize import minimize; res = minimize(lambda x: x[0]**2 + x[1]**2, [1, 1], method='CG'); print('Min at', res.x)"

Sympy_Gradient

uv run python -m runtime.harness scripts/sympy_compute.py diff "x**2 + y**2" --var "[x, y]"

Key Techniques

From indexed textbooks:

[nonlinear programming_tif] Gradient Methods** - These methods use gradient information to iteratively approach the optimum. Convergence** - Addressing convergence properties. Descent Directions and Stepsize Rules:** Focuses on how to choose descent directions and appropriate step sizes.
[nonlinear programming_tif] The application of gradient methods to unconstrained optimal control prob- lems is straightforward in principle. For example the steepest descent method takes the form W = b oMV H, (kb ph,y), i=0,. Pl = Thus, given u¥, one computes zF by forward propagation of the system equation, and then p*¥ by backward propagation of the adjoint equation.
[nonlinear programming_tif] Footer or Trailing Row**: - There is an empty concluding element indicated by a single ". Overall, this table serves as an index for chapters or sections within a document, with particular emphasis on optimization methods and related mathematical strategies, as evidenced by the listed methods like Gradient, Newton, and other derivative techniques. The scattered letters and empty slots may denote a form of stylistic or formatting choice rather than meaningful content in this context.
[nonlinear programming_tif] Zoutendijk’s method uses tw ) oscalatse)Oand'ye 0,1), a i ! P, where ¢ — Y™k € and my is the firs onnegative k ok 28 %, ) it T #(z*,7"e) < -y (a) Show that (b) Prove that {d*} is gradient relat ishi i i Tt pones A related, thus establishing stationarity of the 2. Min-H Method for Optimal Control) Consider the problem of findin g sequences u = (z1,22,.
[nonlinear programming_tif] Mustration of the function f of Exercise 1. Stability) (www) We are often interested in whether optimal solutions change radically when the problem data are slightly perturbed. This issue is addressed by stability analysis, to be contrasted with sensitivity analysis, which deals with how much optimal solutions change when problem data change.

Cognitive Tools Reference

See .claude/skills/math-mode/SKILL.md for full tool documentation.

Source

git clone https://github.com/parcadei/Continuous-Claude-v3/blob/main/.claude/skills/math/optimization/gradient-methods/SKILL.md

View on GitHub

Overview

Gradient Methods covers the core gradient-based strategies for optimization, including basic gradient descent, step-size rules, accelerated variants, and Newton-type methods. It also outlines convergence diagnostics to ensure progress and practical tooling like SciPy optimizers. This guide helps practitioners pick the right method and settings for different problem classes.

How This Skill Works

Begin with a descent direction given by the gradient and update the iterate with an appropriate step size. Depending on the problem, you choose from fixed, diminishing, backtracking (Armijo), exact line search, or adaptive methods. For faster convergence on smooth or quadratic problems, you can use accelerated schemes or Newton/quasi-Newton updates, and you validate progress by monitoring gradient norms and objective values, sometimes leveraging SciPy optimizers.

When to Use It

Starting with a basic gradient descent baseline on a differentiable objective.
Tuning step size with fixed, diminishing, backtracking, or exact line search.
Seeking faster convergence via momentum, Nesterov, or conjugate gradient for quadratics.
When Hessians are available or affordable: Newton's method or quasi-Newton (BFGS).
When you must verify progress with convergence diagnostics and gradient norms.

Quick Start

Step 1: Define objective f and an initial guess x0.
Step 2: Pick a method and step-size rule (e.g., backtracking or fixed alpha).
Step 3: Iterate until convergence criteria are met (e.g., ||grad f|| < tol), or use SciPy minimize with the chosen method.

Best Practices

Start with a simple gradient descent baseline and escalate only as needed.
Choose a step-size strategy aligned with your problem: fixed, diminishing, backtracking, or line search.
Use adaptive methods (Adam, RMSprop) for ML-scale problems where appropriate.
Regularly monitor ||grad f|| and objective value to detect stagnation or oscillations.
Leverage established solvers (e.g., SciPy minimize with CG or BFGS) for reliable performance.

Example Use Cases

Minimize Rosenbrock function using SciPy's BFGS to illustrate quasi-Newton efficiency.
Minimize a simple quadratic f(x) = x^2 + y^2 with CG to show fast convergence on convex problems.
Apply backtracking Armijo line search to ensure descent when a fixed alpha overshoots.
Tackle an unconstrained optimal control problem using gradient methods (forward/backward propagation insights).
Train a small ML model using Adam/RMSprop for adaptive step sizes in stochastic settings.

Frequently Asked Questions

Add this skill to your agents