What is upsampling and downsampling?
In a classification task, there is a high chance for the algorithm to be biased if the dataset is imbalanced. An imbalanced dataset is one in which the number of samples in one class is very higher or lesser than the number of samples in the other class. An example of an imbalanced dataset is the task of classifying 10% and 90% car loans given to a bank’s customers. In this scenario, the algorithm is likely to be biased towards classifying all the 10% loans as good and all the 90% loans as bad. An algorithm that is not biased, however, will always classify a class as good or bad. This means that the decision of whether a loan is good or bad, for example, can’t be determined without determining the probability of each class.
Sampling is a method of data collection where we happen to observe a small subset of the population. One of the biggest problems with sampling is that if it is done in an imbalanced way, then we end up with biased data. To counter such imbalanced datasets, we use a technique called up-sampling and down-sampling. Up-sampling and down-sampling work by taking a small subset of the population, and then applying a correction or weighting process to that subset to ensure that the data collected is reflective of the population as a whole. For example, in an experiment on people’s opinions on different opinions, we might want to find out what percentage of people agree with opinion A and opinion B.
In up-sampling, we randomly duplicate the observations from the minority class in order to reinforce its signal. The most common way is to resample with replacement. This is equivalent to creating a random variable that has mean 0, and variance 1. For example, if we wanted to apply up-sampling on the data in the following table: Class A Class B Class C Number of Observations 5 10 15 Mean 3.5 4 5 Variance 2 2 1Number of Observations resampled with replacement: 5, 10, 15. Mean resampled with replacement: 2.05, 2.75, 3 .5. Variance resampled with replacement: 2, 2, 1. This ensures that the mean and variance of the resampled data are consistent with those of the original data set.
In down-sampling, we randomly remove the observations from the majority class. Thus after up-sampling or down-sampling, the dataset becomes balanced with same number of observations in each class .The final result of up-sampling and down-sampling is to preserve the distribution of the data.. .One way of decreasing the variance in a dataset is to down-sample the data.