Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Shoaib Ahmed Siddiqui, Nitarshan Rajkumar, Tegan Maharaj, David Krueger, Sara Hooker

How do we surface interesting subsets from the data distribution?

Modern machine learning research relies on a relatively small set of carefully curated datasets. Even in these datasets, and typically in `untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we provide a unified and efficient framework for Metadata Archaeology -- uncovering and inferring metadata of examples in a dataset. This inferred metadata can bring to light biases and other data issues a posteriori.

We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these curated subsets to infer metadata of interest. We compare loss trajectories against our curated subsets in order to identify training examples with similar metadata, and show that this simple approach is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, identifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples.

abstract_typical     abstract_corrupted     abstract_atypical     abstract_random_outputs
MAP-D can surface interesting examples from the different probe categories starting from typical, corrupted, atypical, and random outputs probe categories (see paper for a detailed discussion regarding the probe categories). The example difficulty rises monotonically from typical to random outputs category (loss profiles are also sorted accordingly).

The primary contributions of our work can be summarized as follows:

1. We provide a unified and general framework leveraging the training dynamics of a network to identify and treat different data subsets present in a dataset.
2. We show how this framework can be leveraged to audit large-scale datasets, or even debug model training, with negligible added cost. This is in contrast to prior work where multiple models are typically used to surface these examples.
3. We use our framework to identify and correct mislabeled examples in a dataset. This simple technique is on-par with more sophisticated methods developed for this purpose, while enabling natural extension to arbitrary number of modes. We also showcase our method's capability to naturally deal with multiple sources of uncertainty, outperforming other competing approaches.
4. Finally, we show that our method can identify different interesting subsets such as minority group samples or can even surface examples for prioritized training in a data-efficient manner.

Auditing Datasets

MAP-D can be an effective tool to audit high-dimensional datasets. Below, we plot images from four different probe categories (i.e., typical, atypical, corrupted inputs, and random outputs probe categories) for randomly selected classes from CIFAR-10, CIFAR-100 and ImageNet.


Examples surfaced via the `typical` probe category: Images surfaced through the typical probe category are mostly well centered images with typical color scheme where the only object in the image is the object of interest.
Examples surfaced via the `corrupted` probe category: Images surfaced through the corrupted probe category has a slightly higher complexity than the images surfaced through the typical probe, while being slightly lower in complexity as compared to atypical examples. They form a natural transition between typical and atypical examples.
Examples surfaced via the `atypical` probe category: Images surfaced through the atypical probe category present the object in unusual settings or vantage points, or features differences in color scheme from the typical variants.
Examples surfaced via the `random outputs` probe category: Images surfaced through the random outputs probe category represent images that would be hard for a human to classify, might contain multiple labels which are appropriate for that image, or might even be out-of-distribution w.r.t. the rest of the data distribution.

ImageNet

Typical

Corrupted

Atypical

Random Outputs

typical_im_tissue
Toilet Tissue
corrupted_im_tissue
Toilet Tissue
atypical_im_tissue
Toilet Tissue
random_outputs_im_tissue
Toilet Tissue
typical_im_ambulance
Ambulance
corrupted_im_ambulance
Ambulance
atypical_im_ambulance
Ambulance
random_outputs_im_ambulance
Ambulance
typical_im_rifle
Assault Rifle
corrupted_im_rifle
Assault Rifle
atypical_im_rifle
Assault Rifle
random_outputs_im_rifle
Assault Rifle
typical_im_balloon
Balloon
corrupted_im_balloon
Balloon
atypical_im_balloon
Balloon
random_outputs_im_balloon
Balloon
typical_im_bathtub
Bathtub
corrupted_im_bathtub
Bathtub
atypical_im_bathtub
Bathtub
typical_im_bathtub
Bathtub
typical_im_bulletproof_vest
Bulletproof Vest
corrupted_im_bulletproof_vest
Bulletproof Vest
atypical_im_bulletproof_vest
Bulletproof Vest
random_outputs_im_bulletproof_vest
Bulletproof Vest
typical_im_cannon
Cannon
corrupted_im_cannon
Cannon
atypical_im_cannon
Cannon
random_outputs_im_cannon
Cannon
typical_im_container_ship
Container Ship
corrupted_im_container_ship
Container Ship
atypical_im_container_ship
Container Ship
random_outputs_im_container_ship
Container Ship
typical_im_desktop_computer
Desktop Computer
corrupted_im_desktop_computer
Desktop Computer
atypical_im_desktop_computer
Desktop Computer
random_outputs_im_desktop_computer
Desktop Computer
typical_im_digital_watch
Digital Watch
corrupted_im_digital_watch
Digital Watch
atypical_im_digital_watch
Digital Watch
random_outputs_im_digital_watch
Digital Watch
typical_im_garbage_truck
Garbage Truck
corrupted_im_garbage_truck
Garbage Truck
atypical_im_garbage_truck
Garbage Truck
random_outputs_im_garbage_truck
Garbage Truck
typical_im_ipod
iPod
corrupted_im_ipod
iPod
atypical_im_ipod
iPod
random_outputs_im_ipod
iPod
typical_im_microwave
Microwave
corrupted_im_microwave
Microwave
atypical_im_microwave
Microwave
random_outputs_im_microwave
Microwave
typical_im_milk_can
Milk Can
corrupted_im_milk_can
Milk Can
atypical_im_milk_can
Milk Can
random_outputs_im_milk_can
Milk Can
typical_im_missile
Missile
corrupted_im_missile
Missile
atypical_im_missile
Missile
random_outputs_im_missile
Missile
typical_im_mountain_bike
Mountain Bike
corrupted_im_mountain_bike
Mountain Bike
atypical_im_mountain_bike
Mountain Bike
random_outputs_im_mountain_bike
Mountain Bike
typical_im_plastic_bag
Plastic Bag
corrupted_im_plastic_bag
Plastic Bag
atypical_im_plastic_bag
Plastic Bag
random_outputs_im_plastic_bag
Plastic Bag
typical_im_safe
Safe
corrupted_im_safe
Safe
atypical_im_safe
Safe
random_outputs_im_safe
Safe
typical_im_tank
Tank
corrupted_im_tank
Tank
atypical_im_tank
Tank
random_outputs_im_tank
Tank
typical_im_toaster
Toaster
corrupted_im_toaster
Toaster
atypical_im_toaster
Toaster
random_outputs_im_toaster
Toaster
typical_im_whiskey_jug
Whiskey Jug
corrupted_im_whiskey_jug
Whiskey Jug
atypical_im_whiskey_jug
Whiskey Jug
random_outputs_im_whiskey_jug
Whiskey Jug
typical_im_street_sign
Street Sign
corrupted_im_street_sign
Street Sign
atypical_im_street_sign
Street Sign
random_outputs_im_street_sign
Street Sign
typical_im_pineapple
Pineapple
corrupted_im_pineapple
Pineapple
atypical_im_pineapple
Pineapple
random_outputs_im_pineapple
Pineapple
typical_im_banana
Banana
corrupted_im_banana
Banana
atypical_im_banana
Banana
random_outputs_im_banana
Banana

CIFAR-100

Typical

Corrupted

Atypical

Random Outputs

typical_cifar100_baby
Baby
corrupted_cifar100_baby
Baby
atypical_cifar100_baby
Baby
random_outputs_cifar100_baby
Baby
typical_cifar100_man
Man
corrupted_cifar100_man
Man
atypical_cifar100_man
Man
random_outputs_cifar100_man
Man
typical_cifar100_bicycle
Bicycle
corrupted_cifar100_bicycle
Bicycle
atypical_cifar100_bicycle
Bicycle
typical_cifar100_random_outputs
Bicycle
typical_cifar100_bowl
Bowl
corrupted_cifar100_bowl
Bowl
atypical_cifar100_bowl
Bowl
random_outputs_cifar100_bowl
Bowl
typical_cifar100_clock
Clock
corrupted_cifar100_clock
Clock
atypical_cifar100_clock
Clock
random_outputs_cifar100_clock
Clock
typical_cifar100_cloud
Cloud
corrupted_cifar100_cloud
Cloud
atypical_cifar100_cloud
Cloud
random_outputs_cifar100_cloud
Cloud
typical_cifar100_couch
Couch
corrupted_cifar100_couch
Couch
atypical_cifar100_couch
Couch
random_outputs_cifar100_couch
Couch
typical_cifar100_crab
Crab
corrupted_cifar100_crab
Crab
atypical_cifar100_crab
Crab
random_outputs_cifar100_crab
Crab
typical_cifar100_dinosaur
Dinosaur
typical_cifar100_dinosaur
Dinosaur
atypical_cifar100_dinosaur
Dinosaur
random_outputs_cifar100_dinosaur
Dinosaur
typical_cifar100_flatfish
Flatfish
corrupted_cifar100_flatfish
Flatfish
atypical_cifar100_flatfish
Flatfish
random_outputs_cifar100_flatfish
Flatfish
typical_cifar100_lion
Lion
corrupted_cifar100_lion
Lion
atypical_cifar100_lion
Lion
random_outputs_cifar100_lion
Lion
typical_cifar100_house
House
corrupted_cifar100_house
House
atypical_cifar100_house
House
random_outputs_cifar100_house
House
typical_cifar100_lobster
Lobster
corrupted_cifar100_lobster
Lobster
atypical_cifar100_lobster
Lobster
random_outputs_cifar100_lobster
Lobster
typical_cifar100_porcupine
Porcupine
corrupted_cifar100_porcupine
Porcupine
atypical_cifar100_porcupine
Porcupine
random_outputs_cifar100_porcupine
Porcupine
typical_cifar100_bee
Bee
corrupted_cifar100_bee
Bee
atypical_cifar100_bee
Bee
random_outputs_cifar100_bee
Bee
typical_cifar100_ray
Ray
corrupted_cifar100_ray
Ray
atypical_cifar100_ray
Ray
random_outputs_cifar100_ray
Ray
typical_cifar100_plate
Plate
corrupted_cifar100_plate
Plate
atypical_cifar100_plate
Plate
random_outputs_cifar100_plate
Plate
typical_cifar100_sea
Sea
corrupted_cifar100_sea
Sea
atypical_cifar100_sea
Sea
random_outputs_cifar100_sea
Sea
typical_cifar100_table
Table
corrupted_cifar100_table
Table
atypical_cifar100_table
Table
random_outputs_cifar100_table
Table
typical_cifar100_skyscraper
Skyscraper
corrupted_cifar100_skyscraper
Skyscraper
atypical_cifar100_skyscraper
Skyscraper
random_outputs_cifar100_skyscraper
Skyscraper
typical_cifar100_telephone
Telephone
corrupted_cifar100_telephone
Telephone
atypical_cifar100_telephone
Telephone
random_outputs_cifar100_telephone
Telephone
typical_cifar100_wardrobe
Wardrobe
typical_cifar100_corrupted
Wardrobe
atypical_cifar100_wardrobe
Wardrobe
random_outputs_cifar100_wardrobe
Wardrobe

Learn More

Pre-computed output Images for CIFAR-10/CIFAR-100/ImageNet are available here.

We welcome additional discussion and code contributions on the topic of this work. A comprehensive introduction of the methodology, experiment framework and results can be found in our paper and open source code.

Citation

If you use this software, please consider citing:

@article{siddiqui2022metadataarchaeology,
  title={Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics},
  author={Siddiqui, Shoaib Ahmed and Rajkumar, Nitarshan and Maharaj, Tegan and Krueger, David and Hooker, Sara},
  journal={arXiv preprint},
  year={2022}
}

}