Skip to main content

Optimization Methods and Batch Size

  1. RMSProp Adjusts the learning rate according to the gradient (derivative). Unlike Momentum SGD, it does not apply corrections to the derivative itself but only corrects the learning rate.

  2. Adam Momentum + RMSProp. The best of both worlds: RMSProp's learning rate suppression (short-term adjustment) and Momentum's smoothed gradient changes (long-term adjustment). RMSProp increases or decreases the update amount based on the current gradient value. By replacing that gradient value with a moving average, gradient changes are smoothly suppressed. The moving average value is more strongly corrected in the later stages of training, so the Momentum correction also becomes stronger. Conversely, in the early stages of training, both effects are minimal, resulting in larger gradient changes that lead to efficient learning.

Momentum and Moving Average

Moving Average

Computing a moving average yields a smoothed version of a graph with volatile gradient changes. It can smooth out the jagged parts of SGD into gentle curves, suppressing excessive changes in the gradient. It can also be computed as the sum of the cumulative sum up to the current point and the current value.

νx=βνx1+(1β)V\nu_x = \beta\nu_{x-1} + (1-\beta)V

Momentum

It simply subtracts the moving-averaged gradient instead of the current gradient.

νt=βνt1+(1β)Gwt=wt1ανt\nu_t = \beta\nu_{t-1} + (1-\beta)G \rightarrow w_t = w_{t-1} - \alpha \nu_t

Together They Make Adam

νt=β1νt1+(1β1)Gst=β2st1+(1β2)G2wt=wt1ανtst+ϵ\nu_{t} = \beta_1\nu_{t-1} + (1-\beta_1)G s_{t} = \beta_2s_{t-1} + (1-\beta_2)G^2 w_t = w_{t-1} - \alpha\frac{\nu_{t}}{\sqrt{s_t + \epsilon}}


The Effect of Batch Size

  1. Sensitivity to individual data samples
  2. Computation speed per epoch
  3. Memory usage

1. Sensitivity to Individual Data Samples

As the number of samples increases, the sample variance converges to the population variance. Conversely, when the number of samples is small, the loss varies greatly depending on the sampling results.

θt+1θtϵ(t)1Bb=0B1L(θ,mb)θ\theta_{t+1} \leftarrow \theta_{t} - \epsilon (t) \frac{1}{B} \sum_{b=0}^{B-1} \frac{\partial \mathcal{L} (\theta, \mathbf{m}_b )}{ \partial \theta }

Here, weights are updated in mini-batch units. The average is taken per mini-batch, and parameters are updated for each mini-batch. This means that the smaller the mini-batch size, the more sensitively it reacts to each individual data point. Conversely, the larger the mini-batch size, the more it is averaged out, capturing the overall characteristics of the mini-batch rather than individual data points.

The smaller the mini-batch size, the more sensitively it reacts to individual data points, leading to unstable learning. However, this can be mitigated to some extent by adjusting parameters such as the learning rate.

2. Stability of Update Values

The number of parameter updates per epoch increases as the mini-batch size decreases, causing the update values to become jagged. This can be addressed by taking a moving average, among other methods.

3. Memory Usage

More samples mean more variables need to be stored in memory. CUDA errors are common in such cases.