(l-onnx-docai-onnx-preview-training-Adagrad)=

# ai.onnx.preview.training - Adagrad


(l-onnx-opai-onnx-preview-training-adagrad-1)=

## Adagrad - 1 (ai.onnx.preview.training)

### Version

- **name**: [Adagrad (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#ai.onnx.preview.training.Adagrad)
- **domain**: `ai.onnx.preview.training`
- **since_version**: `1`
- **function**: `False`
- **support_level**: `SupportType.COMMON`
- **shape inference**: `True`

This version of the operator has been available
**since version 1 of domain ai.onnx.preview.training**.

### Summary

Compute one iteration of ADAGRAD, a stochastic gradient based optimization
algorithm. This operator can conduct the optimization of multiple tensor variables.

Let&#39;s define the behavior of this operator. As you can imagine, ADAGRAD requires
some parameters:

 - The initial learning-rate &#34;R&#34;.
 - The update count &#34;T&#34;. That is, the number of training iterations conducted.
 - A L2-norm regularization coefficient &#34;norm_coefficient&#34;.
 - A learning-rate decay factor &#34;decay_factor&#34;.
 - A small constant &#34;epsilon&#34; to avoid dividing-by-zero.

At each ADAGRAD iteration, the optimized tensors are moved along a direction
computed based on their estimated gradient and accumulated squared gradient. Assume
that only a single tensor &#34;X&#34; is updated by this operator. We need the value of &#34;X&#34;,
its gradient &#34;G&#34;, and its accumulated squared gradient &#34;H&#34;. Therefore, variables in
this operator&#39;s input list are sequentially &#34;R&#34;, &#34;T&#34;, &#34;X&#34;, &#34;G&#34;, and &#34;H&#34;. Other
parameters are given as attributes because they are usually constants. Also, the
corresponding output tensors are the new value of &#34;X&#34; (called &#34;X_new&#34;), and then
the new accumulated squared gradient (called &#34;H_new&#34;). Those outputs are computed
from the given inputs following the pseudo code below.

Let &#34;+&#34;, &#34;-&#34;, &#34;*&#34;, and &#34;/&#34; are all element-wise arithmetic operations with
numpy-style broadcasting support. The pseudo code to compute those outputs is:

  // Compute a scalar learning-rate factor. At the first update of X, T is generally
  // 0 (0-based update index) or 1 (1-based update index).
  r = R / (1 + T * decay_factor);

  // Add gradient of 0.5 * norm_coefficient * ||X||_2^2, where ||X||_2 is the 2-norm.
  G_regularized = norm_coefficient * X + G;

  // Compute new accumulated squared gradient.
  H_new = H + G_regularized * G_regularized;

  // Compute the adaptive part of per-coordinate learning rate. Note that Sqrt(...)
  // computes element-wise square-root.
  H_adaptive = Sqrt(H_new) + epsilon

  // Compute the new value of &#34;X&#34;.
  X_new = X - r * G_regularized / H_adaptive;

If one assign this operators to optimize multiple inputs, for example, &#34;X_1&#34; and &#34;X_2&#34;, the same
pseudo code may be extended to handle all tensors jointly. More specifically, we can view &#34;X&#34; as a
concatenation of &#34;X_1&#34; and &#34;X_2&#34; (of course, their gradient and accumulate gradient should
be concatenated too) and then just reuse the entire pseudo code.

Note that ADAGRAD was first proposed in http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
In that reference paper, this operator is a special case of the Figure 1&#39;s composite mirror
descent update.

### Attributes

* **decay_factor - FLOAT** (default is `0.0`):

  The decay factor of learning rate after one update.The effective learning rate is computed by r = R / (1 + T * decay_factor). Default to 0 so that increasing update counts doesn&#39;t reduce the learning rate.

* **epsilon - FLOAT** (default is `0.0`):

  Small scalar to avoid dividing by zero.

* **norm_coefficient - FLOAT** (default is `0.0`):

  Regularization coefficient in 0.5 * norm_coefficient * ||X||_2^2. Default to 0, which means no regularization.

### Inputs

Between 3 and 2147483647 inputs.

- **R** (heterogeneous) - **T1**:

  The initial learning rate.
- **T** (heterogeneous) - **T2**:

  The update count of &#34;X&#34;. It should be a scalar.
- **inputs** (variadic) - **T3**:

  The current values of optimized tensors, followed by their respective gradients, followed by their respective accumulated squared gradients.For example, if two tensor &#34;X_1&#34; and &#34;X_2&#34; are optimized, The input list would be [&#34;X_1&#34;, &#34;X_2&#34;, gradient of &#34;X_1&#34;, gradient of &#34;X_2&#34;, accumulated squared gradient of &#34;X_1&#34;, accumulated squared gradient of &#34;X_2&#34;].

### Outputs

Between 1 and 2147483647 outputs.

- **outputs** (variadic) - **T3**:

  Updated values of optimized tensors, followed by their updated values of accumulated squared gradients. For example, if two tensor &#34;X_1&#34; and &#34;X_2&#34; are optimized, the output list would be [new value of &#34;X_1,&#34; new value of &#34;X_2&#34; new accumulated squared gradient of &#34;X_1&#34;, new accumulated squared gradient of &#34;X_2&#34;].

### Type Constraints

* **T1** in ( `tensor(double)`, `tensor(float)` ):

  Constrain input types to float scalars.
* **T2** in ( `tensor(int64)` ):

  Constrain input types to 64-bit integer scalars.
* **T3** in ( `tensor(double)`, `tensor(float)` ):

  Constrain input and output types to float tensors.