Unearthing Data Subsets by Leveraging Training Dynamics

Siddiqui, Shoaib Ahmed; Rajkumar, Nitarshan; Maharaj, Tegan; Krueger, David; Hooker, Sara

Paper

Code

Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Shoaib Ahmed Siddiqui, Nitarshan Rajkumar, Tegan Maharaj, David Krueger, Sara Hooker

How do we surface interesting subsets from the data distribution?

Modern machine learning research relies on a relatively small set of carefully curated datasets. Even in these datasets, and typically in `untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we provide a unified and efficient framework for Metadata Archaeology -- uncovering and inferring metadata of examples in a dataset. This inferred metadata can bring to light biases and other data issues a posteriori.

We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these curated subsets to infer metadata of interest. We compare loss trajectories against our curated subsets in order to identify training examples with similar metadata, and show that this simple approach is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, identifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples.

MAP-D can surface interesting examples from the different probe categories starting from typical, corrupted, atypical, and random outputs probe categories (see paper for a detailed discussion regarding the probe categories). The example difficulty rises monotonically from typical to random outputs category (loss profiles are also sorted accordingly).

The primary contributions of our work can be summarized as follows:

1. We provide a unified and general framework leveraging the training dynamics of a network to identify and treat different data subsets present in a dataset.

2. We show how this framework can be leveraged to audit large-scale datasets, or even debug model training, with negligible added cost. This is in contrast to prior work where multiple models are typically used to surface these examples.

3. We use our framework to identify and correct mislabeled examples in a dataset. This simple technique is on-par with more sophisticated methods developed for this purpose, while enabling natural extension to arbitrary number of modes. We also showcase our method's capability to naturally deal with multiple sources of uncertainty, outperforming other competing approaches.

4. Finally, we show that our method can identify different interesting subsets such as minority group samples or can even surface examples for prioritized training in a data-efficient manner.

Auditing Datasets

MAP-D can be an effective tool to audit high-dimensional datasets. Below, we plot images from four different probe categories (i.e., typical, atypical, corrupted inputs, and random outputs probe categories) for randomly selected classes from CIFAR-10, CIFAR-100 and ImageNet.

Examples surfaced via the `typical` probe category: Images surfaced through the typical probe category are mostly well centered images with typical color scheme where the only object in the image is the object of interest.

Examples surfaced via the `corrupted` probe category: Images surfaced through the corrupted probe category has a slightly higher complexity than the images surfaced through the typical probe, while being slightly lower in complexity as compared to atypical examples. They form a natural transition between typical and atypical examples.

Examples surfaced via the `atypical` probe category: Images surfaced through the atypical probe category present the object in unusual settings or vantage points, or features differences in color scheme from the typical variants.

Examples surfaced via the `random outputs` probe category: Images surfaced through the random outputs probe category represent images that would be hard for a human to classify, might contain multiple labels which are appropriate for that image, or might even be out-of-distribution w.r.t. the rest of the data distribution.

ImageNet

Typical

Corrupted

Atypical

Random Outputs

Toilet Tissue

Toilet Tissue

Toilet Tissue

Toilet Tissue

Ambulance

Assault Rifle

Assault Rifle

Assault Rifle

Assault Rifle

Balloon

Bathtub

Bulletproof Vest

Cannon

Container Ship

Desktop Computer

Digital Watch

Garbage Truck

iPod

Microwave

Milk Can

Missile

Mountain Bike

Plastic Bag

Safe

Tank

Toaster

Whiskey Jug

Street Sign

Pineapple

Banana

CIFAR-100

Typical

Corrupted

Atypical

Random Outputs

Baby

Man

Man

Man

Man

Bicycle

Bicycle

Bowl

Clock

Cloud

Couch

Crab

Dinosaur

Flatfish

Lion

House

Lobster

Porcupine

Bee

Bee

Bee

Bee

Ray

Ray

Ray

Ray

Plate

Sea

Sea

Sea

Sea

Table

Skyscraper

Telephone

Wardrobe

Wardrobe

Learn More

Pre-computed output Images for CIFAR-10/CIFAR-100/ImageNet are available here.

We welcome additional discussion and code contributions on the topic of this work. A comprehensive introduction of the methodology, experiment framework and results can be found in our paper and open source code.

Citation

If you use this software, please consider citing:

@article{siddiqui2022metadataarchaeology,
  title={Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics},
  author={Siddiqui, Shoaib Ahmed and Rajkumar, Nitarshan and Maharaj, Tegan and Krueger, David and Hooker, Sara},
  journal={arXiv preprint},
  year={2022}
}

}