Implicit Gradient Transport - Sébastien M. R. Arnold
\( % Universal Mathematics \newcommand{\paren}[1]{\left( #1 \right)} \newcommand{\brackets}[1]{\left[ #1 \right]} \newcommand{\braces}[1]{\left\{ #1 \right\}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\case}[1]{\begin{cases} #1 \end{cases}} \newcommand{\bigO}[1]{\mathcal{O}\left(#1\right)} % Analysis % Linear Algebra \newcommand{\mat}[1]{\begin{pmatrix}#1\end{pmatrix}} \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} % Probability Theory \DeclareMathOperator*{\V}{\mathop{\mathrm{Var}}} \DeclareMathOperator*{\E}{\mathop{\mathbb{E}}} \newcommand{\Exp}[2][]{\E_{#1}\brackets{#2}} \newcommand{\Var}[2][]{\V_{#1}\brackets{#2}} \newcommand{\Cov}[2][]{\mathop{\mathrm{Cov}}_{#1}\brackets{#2}} % Optimization \newcommand{\minimize}{\operatorname*{minimize}} \newcommand{\maximize}{\operatorname*{maximize}} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} % Set Theory \newcommand{\C}{\mathbb{C}} \newcommand{\N}{\mathbb{N}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\R}{\mathbb{R}} \newcommand{\Z}{\mathbb{Z}} \)

Reducing the variance in online optimization by transporting past gradients

Summary

Most stochastic optimization methods use gradients once before discarding them. While variance reduction methods have shown that reusing past gradients can be beneficial when there is a finite number of datapoints, they do not easily extend to the online setting. One issue is the staleness due to using past gradients. We propose to correct this staleness using the idea of implicit gradient transport (IGT) which transforms gradients computed at previous iterates into gradients evaluated at the current iterate without using the Hessian explicitly. In addition to reducing the variance and bias of our updates over time, IGT can be used as a drop-in replacement for the gradient estimate in a number of well-understood methods such as heavy ball or Adam. We show experimentally that it achieves state-of-the-art results on a wide range of architectures and benchmarks. Additionally, the IGT gradient estimator yields the optimal asymptotic convergence rate for online stochastic optimization in the restricted setting where the Hessians of all component functions are equal.


Pseudocode

Pseudo-code of Heavyball using the IGT gradient estimator.


Illustration

Schematic representation of implicit gradient transport: computing the gradient at an offset parameter value provides a correction used to “transport” a gradient estimate in \(\theta_{t-1}\).


Reference

@inproceedings{NEURIPS2019_1dba5eed,
 author = {Arnold, S\'{e}bastien M. R. and Manzagol, Pierre-Antoine and Babanezhad Harikandeh, Reza and Mitliagkas, Ioannis and Le Roux, Nicolas},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
 pages = {5391--5402},
 publisher = {Curran Associates, Inc.},
 title = {Reducing the variance in online optimization by transporting past gradients},
 url = {https://proceedings.neurips.cc/paper/2019/file/1dba5eed8838571e1c80af145184e515-Paper.pdf},
 volume = {32},
 year = {2019}
}

Contact
Séb Arnold - seb.arnold@usc.edu

About

Reusing past gradients’ information reduces variance, accelerates convergence.

Venue

NeurIPS 2019

People
Resources