Data Cleansing Methods for SGD-Optimized Models
Data Cleansing for Models Trained with SGD
https://arxiv.org/abs/1906.08473
GitHub implementation https://github.com/sato9hara/sgd-influence
Overview
When training does not go well, one may analyze the original data. However, when performing error analysis on data, domain knowledge is required to reason about which instances (data points) strongly influence the training. This paper presents a method to cleanse instances that negatively affect training without requiring domain knowledge.
Keyword:
- influence estimation
- counterfactual SGD
- Linear Influence Estimation (LIE)
- Kendall's tau
- Jaccard index
2. Preliminaries
A review of SGD.
3. SGD-Influence
At training step , letting denote the gradient when instance is removed, SGD-influence is defined as follows: In other words, if the difference between when the j-th instance is dropped and without any change is large, that instance has a significant influence on training. Since this takes a reverse perspective of looking back at the data from the results, it is called Counterfactual SGD (counterfactual: what would happen if it were absent).
Linear Influence Estimation (LIE)
By substituting the first derivative of the prediction function or the first derivative of the loss function for , one can use this formulation. With the latter, one can measure the influence of instance on the training loss.
In practice, this is approximated via Taylor expansion:
and can be computed accordingly.
However, computing this naively would require N rounds of SGD computation (where N is the number of instances), resulting in high computational cost. This paper proposes an efficient method for this.
4. Estimating SGD-Influence
This part is not yet fully understood. After transformation, it becomes (second derivative).
Eq(2) is derived by repeatedly applying the recurrence relation. Eq(3), (4) appear to establish inequalities based on concepts of continuity in calculus. https://ja.wikipedia.org/wiki/%E3%83%AA%E3%83%97%E3%82%B7%E3%83%83%E3%83%84%E9%80%A3%E7%B6%9A
5. Proposed Method for LIE
The computational cost is reduced from the original to :
- p: dimensionality of the data?
- T: total SGD steps
- M: the largest batch size in SGD
- : complexity for computing the parameter gradient (determined by the choice of optimizer?)
6. Related Studies
Differences from previous similar approaches:
- No need to assume an optimized model (the model at each step of training can be used)
- The loss does not need to be convex; non-convex losses are also acceptable
Compared papers include: Understanding Black-box Predictions via Influence Functions among others.
Typical data cleansing approaches such as outlier detection (e.g., Isolation Forest for tabular data, or reconstruction via VAE/GAN for images) may flag data points as outliers, but those outliers do not necessarily have a negative impact on training. In contrast, instances identified as outliers by the proposed method are actually linked to their influence on training.
7. Experiments
7.1 Evaluation of LIE
Evaluation was conducted using: Kendall's tau Jaccard index These metrics need further investigation.
Fig1, Table1: The axes are not immediately clear, but the closer the points lie to the y=x line, the more accurately the instance influence is being evaluated. The proposed method (blue) outperforms the previous method (red, K&L). Importantly, the method correctly evaluates instance influence in both the convex case (linear logistic regression) and the non-convex case (DNN).
7.2 Evaluation of Data Cleansing
In the case of an autoencoder, it tends to flag vivid-colored backgrounds as anomalous, but removing these does not significantly affect training (no improvement). On the other hand, images where the boundary between the object and the background is blurry are not flagged as anomalous by the autoencoder, yet they have a significant influence on training (improvement when removed). The proposed method was able to detect this latter type of anomaly.