Optimizers¶
The various optimizers that you can use to tune your parameters
-
struct dynet
::
SimpleSGDTrainer
¶ - #include <training.h>
Stochastic gradient descent trainer.
This trainer performs stochastic gradient descent, the goto optimization procedure for neural networks. In the standard setting, the learning rate at epoch \(t\) is \(\eta_t=\frac{\eta_0}{1+\eta_{\mathrm{decay}}t}\)
Reference : reference needed
Inherits from dynet::Trainer
Public Functions
-
dynet::SimpleSGDTrainer
SimpleSGDTrainer
(Model &m, real e0 = 0.1, real edecay = 0.0)¶ Constructor.
- Parameters
m
: Model to be trainede0
: Initial learning rateedecay
: Learning rate decay parameter.
-
dynet::SimpleSGDTrainer
-
struct dynet
::
CyclicalSGDTrainer
¶ - #include <training.h>
Cyclical learning rate SGD.
This trainer performs stochastic gradient descent with a cyclical learning rate as proposed in Smith, 2015.
This uses a triangular function with optional exponential decay.
More specifically, at each update, the learning rate \(\eta\) is updated according to :
\( \begin{split} \text{cycle} &= \left\lfloor 1 + \frac{\texttt{it}}{2 \times\texttt{step_size}} \right\rfloor\\ x &= \left\vert \frac{\texttt{it}}{\texttt{step_size}} - 2 \times \text{cycle} + 1\right\vert\\ \eta &= \eta_{\text{min}} + (\eta_{\text{max}} - \eta_{\text{min}}) \times \max(0, 1 - x) \times \gamma^{\texttt{it}}\\ \end{split} \)
Reference : Cyclical Learning Rates for Training Neural Networks
Inherits from dynet::Trainer
Public Functions
-
dynet::CyclicalSGDTrainer
CyclicalSGDTrainer
(Model &m, float e0_min = 0.01, float e0_max = 0.1, float step_size = 2000, float gamma = 0.0, float edecay = 0.0)¶ Constructor.
- Parameters
m
: Model to be trainede0_min
: Lower learning ratee0_max
: Upper learning ratestep_size
: Period of the triangular function in number of iterations (not epochs). According to the original paper, this should be set around (2-8) x (training iterations in epoch)gamma
: Learning rate upper bound decay parameteredecay
: Learning rate decay parameter. Ideally you shouldn’t use this with cyclical learning rate since decay is already handled by \(\gamma\)
-
dynet::CyclicalSGDTrainer
-
struct dynet
::
MomentumSGDTrainer
¶ - #include <training.h>
Stochastic gradient descent with momentum.
This is a modified version of the SGD algorithm with momentum to stablize the gradient trajectory. The modified gradient is \(\theta_{t+1}=\mu\theta_{t}+\nabla_{t+1}\) where \(\mu\) is the momentum.
Reference : reference needed
Inherits from dynet::Trainer
Public Functions
-
dynet::MomentumSGDTrainer
MomentumSGDTrainer
(Model &m, real e0 = 0.01, real mom = 0.9, real edecay = 0.0)¶ Constructor.
- Parameters
m
: Model to be trainede0
: Initial learning ratemom
: Momentumedecay
: Learning rate decay parameter
-
dynet::MomentumSGDTrainer
-
struct dynet
::
AdagradTrainer
¶ - #include <training.h>
Adagrad optimizer.
The adagrad algorithm assigns a different learning rate to each parameter according to the following formula : \(\delta_\theta^{(t)}=-\frac{\eta_0}{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}\nabla_\theta^{(t)}\)
Reference : Duchi et al., 2011
Inherits from dynet::Trainer
Public Functions
-
dynet::AdagradTrainer::AdagradTrainer(Model & m, real e0 = 0.1, real eps = 1e-20, real edecay = 0.0)
Constructor.
- Parameters
m
: Model to be trainede0
: Initial learning rateeps
: Bias parameter \(\epsilon\) in the adagrad formulaedecay
: Learning rate decay parameter
-
-
struct dynet
::
AdadeltaTrainer
¶ - #include <training.h>
AdaDelta optimizer.
The AdaDelta optimizer is a variant of Adagrad where \(\frac{\eta_0}{\sqrt{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}}\) is replaced by \(\frac{\sqrt{\epsilon+\sum_{i=0}^{t-1}\rho^{t-i-1}(1-\rho)(\delta_\theta^{(i)})^2}}{\sqrt{\epsilon+\sum_{i=0}^{t-1}(\nabla_\theta^{(i)})^2}}\), hence eliminating the need for an initial learning rate.
Reference : ADADELTA: An Adaptive Learning Rate Method
Inherits from dynet::Trainer
Public Functions
-
dynet::AdadeltaTrainer::AdadeltaTrainer(Model & m, real eps = 1e-6, real rho = 0.95, real edecay = 0.0)
Constructor.
- Parameters
m
: Model to be trainedeps
: Bias parameter \(\epsilon\) in the adagrad formularho
: Update parameter for the moving average of updates in the numeratoredecay
: Learning rate decay parameter
-
-
struct dynet
::
RMSPropTrainer
¶ - #include <training.h>
RMSProp optimizer.
The RMSProp optimizer is a variant of Adagrad where the squared sum of previous gradients is replaced with a moving average with parameter \(\rho\).
Reference : reference needed
Inherits from dynet::Trainer
Public Functions
-
dynet::RMSPropTrainer::RMSPropTrainer(Model & m, real e0 = 0.001, real eps = 1e-08, real rho = 0.9, real edecay = 0.0)
Constructor.
- Parameters
m
: Model to be trainede0
: Initial learning rateeps
: Bias parameter \(\epsilon\) in the adagrad formularho
: Update parameter for the moving average (rho = 0
is equivalent to using Adagrad)edecay
: Learning rate decay parameter
-
-
struct dynet
::
AdamTrainer
¶ - #include <training.h>
Adam optimizer.
The Adam optimizer is similar to RMSProp but uses unbiased estimates of the first and second moments of the gradient
Reference : Adam: A Method for Stochastic Optimization
Inherits from dynet::Trainer
Public Functions
-
dynet::AdamTrainer::AdamTrainer(Model & m, float e0 = 0.001, float beta_1 = 0.9, float beta_2 = 0.999, float eps = 1e-8, real edecay = 0.0)
Constructor.
- Parameters
m
: Model to be trainede0
: Initial learning ratebeta_1
: Moving average parameter for the meanbeta_2
: Moving average parameter for the varianceeps
: Bias parameter \(\epsilon\)edecay
: Learning rate decay parameter
-
-
struct dynet
::
Trainer
¶ - #include <training.h>
General trainer struct.
Subclassed by dynet::AdadeltaTrainer, dynet::AdagradTrainer, dynet::AdamTrainer, dynet::CyclicalSGDTrainer, dynet::MomentumSGDTrainer, dynet::RMSPropTrainer, dynet::SimpleSGDTrainer
Public Functions
-
dynet::Trainer
Trainer
(Model &m, real e0, real edecay = 0.0)¶ General constructor for a Trainer.
- Parameters
m
: Model to be trainede0
: Initial learning rateedecay
: Learning rate decay
-
void dynet::Trainer
update
(real scale = 1.0)¶ Update parameters.
Update the parameters according to the appropriate update rule
- Parameters
scale
: The scaling factor for the gradients
-
void dynet::Trainer
update
(const std::vector<unsigned> &updated_params, const std::vector<unsigned> &updated_lookup_params, real scale = 1.0)¶ Update subset of parameters.
Update some but not all of the parameters included in the model. This is the update_subset() function in the Python bindings. The parameters to be updated are specified by index, which can be found for Parameter and LookupParameter objects through the “index” variable (or the get_index() function in the Python bindings).
- Parameters
updated_params
: The parameter indices to be updatedupdated_lookup_params
: The lookup parameter indices to be updatedscale
: The scaling factor for the gradients
Public Members
-
bool dynet::Trainer
sparse_updates_enabled
¶ Whether to perform sparse updates.
DyNet trainers support two types of updates for lookup parameters, sparse and dense. Sparse updates are the default. They have the potential to be faster, as they only touch the parameters that have non-zero gradients. However, they may not always be faster (particulary on GPU with mini-batch training), and are not precisely numerically correct for some update rules such as MomentumTrainer and AdamTrainer. Thus, if you set this variable to false, the trainer will perform dense updates and be precisely correct, and maybe faster sometimes.
-
dynet::Trainer