Skip to main content

Data Cleansing Methods for SGD-Optimized Models

Data Cleansing for Models Trained with SGD

https://arxiv.org/abs/1906.08473

GitHub implementation https://github.com/sato9hara/sgd-influence

Overview

When training does not go well, one may analyze the original data. However, when performing error analysis on data, domain knowledge is required to reason about which instances (data points) strongly influence the training. This paper presents a method to cleanse instances that negatively affect training without requiring domain knowledge.

Keyword:

  • influence estimation
  • counterfactual SGD
  • Linear Influence Estimation (LIE)
  • Kendall's tau
  • Jaccard index

2. Preliminaries

A review of SGD.

3. SGD-Influence

At training step tt, letting θj[t]\theta_{-j}^{[t]} denote the gradient when instance zjz_{j} is removed, SGD-influence is defined as follows: θj[t]θ[t]\theta_{-j}^{[t]} - \theta^{[t]} In other words, if the difference between θ\theta when the j-th instance is dropped and θ\theta without any change is large, that instance has a significant influence on training. Since this takes a reverse perspective of looking back at the data from the results, it is called Counterfactual SGD (counterfactual: what would happen if it were absent).

Linear Influence Estimation (LIE)

Lj[T](u)=u,θj[t]θ[t]L_{-j}^{[T]}(u) = \langle u, \theta_{-j}^{[t]} - \theta^{[t]}\rangle

By substituting the first derivative of the prediction function θf(x,θ)\nabla_{\theta}f(x,\theta) or the first derivative of the loss function θl(x,θ)\nabla_{\theta}l(x,\theta) for uu, one can use this formulation. With the latter, one can measure the influence of instance ziz_i on the training loss.

In practice, this is approximated via Taylor expansion:

Lj[T](θl(x,θ))θl(x,θj[t])θl(x,θ[t])L_{-j}^{[T]}(\nabla_{\theta}l(x,\theta)) \approx \nabla_{\theta}l(x,\theta_{-j}^{[t]}) - \nabla_{\theta}l(x,\theta^{[t]})

and can be computed accordingly.

However, computing this naively would require N rounds of SGD computation (where N is the number of instances), resulting in high computational cost. This paper proposes an efficient method for this.

4. Estimating SGD-Influence

This part is not yet fully understood. After transformation, it becomes Δθj\Delta\theta_{-j} (second derivative).

Eq(2) is derived by repeatedly applying the recurrence relation. Eq(3), (4) appear to establish inequalities based on concepts of continuity in calculus. https://ja.wikipedia.org/wiki/%E3%83%AA%E3%83%97%E3%82%B7%E3%83%83%E3%83%84%E9%80%A3%E7%B6%9A

5. Proposed Method for LIE

The computational cost is reduced from the original O(p2)O(p^2) to O(TMdelta)O(TM\\delta):

  • p: dimensionality of the data?
  • T: total SGD steps
  • M: the largest batch size in SGD
  • delta\\delta: complexity for computing the parameter gradient (determined by the choice of optimizer?)

Differences from previous similar approaches:

  1. No need to assume an optimized model (the model at each step of training can be used)
  2. The loss does not need to be convex; non-convex losses are also acceptable

Compared papers include: Understanding Black-box Predictions via Influence Functions among others.

Typical data cleansing approaches such as outlier detection (e.g., Isolation Forest for tabular data, or reconstruction via VAE/GAN for images) may flag data points as outliers, but those outliers do not necessarily have a negative impact on training. In contrast, instances identified as outliers by the proposed method are actually linked to their influence on training.

7. Experiments

7.1 Evaluation of LIE

Evaluation was conducted using: Kendall's tau Jaccard index These metrics need further investigation.

Fig1, Table1: The axes are not immediately clear, but the closer the points lie to the y=x line, the more accurately the instance influence is being evaluated. The proposed method (blue) outperforms the previous method (red, K&L). Importantly, the method correctly evaluates instance influence in both the convex case (linear logistic regression) and the non-convex case (DNN).

7.2 Evaluation of Data Cleansing

In the case of an autoencoder, it tends to flag vivid-colored backgrounds as anomalous, but removing these does not significantly affect training (no improvement). On the other hand, images where the boundary between the object and the background is blurry are not flagged as anomalous by the autoencoder, yet they have a significant influence on training (improvement when removed). The proposed method was able to detect this latter type of anomaly.