High Contrast ML

 

The word contrast comes up a lot in ML. Specifically, there are

  1. Contrastive Loss - This loss is kind of like a siamese loss, the within-class embeddings should be close (geometrically) and the between-class embeddings should be far apart. Specifically this method minimized L2-distance between embeddings from same-class and laterally-inverted squared-hinge-loss for embeddings between classes.

  2. Triplet Loss and other generalization - Triplet loss simply tries to contrast pairs of pairs. So there is a desirable pair and an undesirable pair. This just generalizes the idea of class-label-based equivalence classes to just directly classifying pairs of examples.

  3. Noise Contrastive Estimation - This is used for density estimation in special types of energy-based-models called the log-linear models where the gradient of the log-likelihood of a sample equals the sample-features minus the expected value of the features under the model's own distribution. The basic idea is to draw negative samples from some kind of a proposal distribution , make one assumption that the partition function is 1, and then do energy based modeling.
  4. Negative Sampling - Make the score of the data higher and make the score of the noise-words lower in the context of the current word.

  5. SimCLR - See below.

  6. Contrastive Estimation by Smith and Eisner - Come up with perturbations that destroy meaning so much that it's better to treat them as negative examples.

Generalized contrast and agreement

In general there can be two types of perturbations, once which preserve meaning/semantics, e.g. small guassian noise in images, adding small patches, cropping, translation, and ones which destory all meaning, such as sampling from a completely different distribution called the noise distribution, extreme permutation of pixels, replacing with instances from a different class. Within NLP it is easy to come up with meaning-destroying perturbation such as permutation, replacement by different noisy sentence,  however it is a lot harder to come up with a meaning preserving perturbation, one way would be back-translation, another way will be replacing certain words by their synonyms, or by "visual back-translation"

Good methods use both types of perturbations, e.g. the SimCLR method maximizes agreement between the representations of meaning preserving perturbations and it minimizes agreement between representations of random unrelated data-samples.