3 Concepts Every Data Scientist Must Know Part - 3

3 Concepts Every Data Scientist Must Know Part – 3

1. What is the significance of sampling? Name some techniques for sampling?

For analyzing the data, we cannot proceed with the whole volume at once for large datasets. We need to take some samples from the data which can represent the whole population. While making a sample out of complete data, we should take the data which can be a true representative of the whole data set.

There are mainly two types of sampling techniques based on statistics.

Probability Sampling and Non-Probability Sampling

Probability Sampling

Simple Random, Clustered Sampling, Stratified Sampling.

Non-Probability Sampling

Convenience Sampling, Quota Sampling, Snowball Sampling.

2. Type 1 and Type 2 Error?

Rejection of True Null Hypothesis is known as a Type 1 error. In simple terms, False Positive are known as a Type 1 Error.

Not rejecting the False Null Hypothesis is known as a Type 2 error. False Negatives are known as a Type 2 Error.

Type 1 Error is significant where the importance of being negative becomes significant. For example – if a man is not suffering from cancer marked as positive. The medications given to him might damage his organs.
While Type 2 Error is significant in cases where the importance of being positive becomes important. For example – the alarm has to be raised in case of burglary in a bank. But a system identifies it as a False case that won’t raise the alarm on time resulting in a heavy loss.

3. Difference between Normalization and Standardization?

Normalization is a process of bringing the features in a simple range, so that model can perform well and do not get inclined towards any particular feature. For example – if we have a dataset with multiple features and one feature is the Age data which is in the range 18-60, another feature is the salary feature ranging from 20000 – 20000000. In such case the values have a very much difference in them. Age ranges in two digits integer while salary is in range significantly higher than the age. So, to bring the features in comparable range we need Normalization.

Both Normalization and Standardization are methods of Features Conversion. However, the methods are different in terms of the conversions. The data after Normalization scales in the range of 0-1. While in case of Standardization the data is scaled such that it means comes out to be 0.

Nomidl