Optimization Methods and Batch Size

RMSProp Adjusts the learning rate according to the gradient (derivative). Unlike Momentum SGD, it does not apply corrections to the derivative itself but only corrects the learning rate.
Adam Momentum + RMSProp. The best of both worlds: RMSProp's learning rate suppression (short-term adjustment) and Momentum's smoothed gradient changes (long-term adjustment). RMSProp increases or decreases the update amount based on the current gradient value. By replacing that gradient value with a moving average, gradient changes are smoothly suppressed. The moving average value is more strongly corrected in the later stages of training, so the Momentum correction also becomes stronger. Conversely, in the early stages of training, both effects are minimal, resulting in larger gradient changes that lead to efficient learning.

Momentum and Moving Average

Moving Average

Computing a moving average yields a smoothed version of a graph with volatile gradient changes. It can smooth out the jagged parts of SGD into gentle curves, suppressing excessive changes in the gradient. It can also be computed as the sum of the cumulative sum up to the current point and the current value.

$\nu_x = \beta\nu_{x-1} + (1-\beta)V$

Momentum

It simply subtracts the moving-averaged gradient instead of the current gradient.

$\nu_t = \beta\nu_{t-1} + (1-\beta)G \rightarrow w_t = w_{t-1} - \alpha \nu_t$

Together They Make Adam

$\nu_{t} = \beta_1\nu_{t-1} + (1-\beta_1)G s_{t} = \beta_2s_{t-1} + (1-\beta_2)G^2 w_t = w_{t-1} - \alpha\frac{\nu_{t}}{\sqrt{s_t + \epsilon}}$

The Effect of Batch Size

Sensitivity to individual data samples
Computation speed per epoch
Memory usage

1. Sensitivity to Individual Data Samples

As the number of samples increases, the sample variance converges to the population variance. Conversely, when the number of samples is small, the loss varies greatly depending on the sampling results.

$\theta_{t+1} \leftarrow \theta_{t} - \epsilon (t) \frac{1}{B} \sum_{b=0}^{B-1} \frac{\partial \mathcal{L} (\theta, \mathbf{m}_b )}{ \partial \theta }$

Here, weights are updated in mini-batch units. The average is taken per mini-batch, and parameters are updated for each mini-batch. This means that the smaller the mini-batch size, the more sensitively it reacts to each individual data point. Conversely, the larger the mini-batch size, the more it is averaged out, capturing the overall characteristics of the mini-batch rather than individual data points.

The smaller the mini-batch size, the more sensitively it reacts to individual data points, leading to unstable learning. However, this can be mitigated to some extent by adjusting parameters such as the learning rate.

2. Stability of Update Values

The number of parameter updates per epoch increases as the mini-batch size decreases, causing the update values to become jagged. This can be addressed by taking a moving average, among other methods.

3. Memory Usage

More samples mean more variables need to be stored in memory. CUDA errors are common in such cases.

Momentum and Moving Average​

Moving Average​

Momentum​

Together They Make Adam​

The Effect of Batch Size​

1. Sensitivity to Individual Data Samples​

2. Stability of Update Values​

3. Memory Usage​