Labeled data

Source: Wikipedia, the free encyclopedia.

Labeled data is a group of

samples
that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags. For example, a data label might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, or whether a dot in an X-ray is a tumor.

Labels can be obtained by asking humans to make judgments about a given piece of unlabeled data. Labeled data is significantly more expensive to obtain than the raw unlabeled data.

Crowdsourced labeled data

In 2006

training data. The researchers downloaded millions of images from the World Wide Web and a team of undergraduates started to apply labels for objects to each image. In 2007 Li outsourced the data labelling work on Amazon Mechanical Turk, an online marketplace for digital piece work. The 3.2 million images that were labelled by more than 49,000 workers formed the basis for ImageNet, one of the largest hand-labeled database for outline of object recognition.[1]

Automated data labelling

After obtaining a labeled dataset, machine learning models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.[2]

Data-driven bias

Algorithmic decision-making is subject to programmer-driven bias as well as data-driven bias. Training data that relies on bias labeled data will result in prejudices and omissions in a

representative sample to not bias the results.[3] Because the labeled data available to train facial recognition systems has not been representative of a population, underrepresented groups in the labeled data are later often misclassified. In 2018 a study by Joy Buolamwini and Timnit Gebru demonstrated that two facial analysis datasets that have been used to train facial recognition algorithms, IJB-A and Adience, are composed of 79.6% and 86.2% lighter skinned humans respectively.[4]

References

  1. .
  2. ^ Johnson, Leif. "What is the difference between labeled and unlabeled data?", Stack Overflow, 4 October 2013. Retrieved on 13 May 2017.  This article incorporates text by lmjohns3 available under the CC BY-SA 3.0 license.
  3. .
  4. .