Missing Values Treatment methods in Machine Learning
- Naveen
- 0
Delete Missing Value Rows
- Missing values can be handled by deleting the rows or columns having null values.
- If columns have more than half of the rows as null then the entire columns can be dropped.
- The rows which are having one or more columns values as null can also dropped.
Pros:
- A model trained with the removal of all missing values creates a robust model as it removes noise from data.
- Easy to implement.
Cons:
- Loss of a lot of information.
- Works poorly if the percentage of missing values is higher.
Impute Missing Values
- For numerical columns, the missing value can be replaced by mean, median, mode of the remaining value of columns.
- For categorical columns, the missing value can be replaced by most frequent observation in the columns.
- The regression & classification algorithm can be trained using remaining data and can be used for imputing missing values.
- ML algorithm such as KNN, MICE library etc. can be used to impute missing value.
Pros:
- Preserves all cases by replacing missing data with an estimated value based on the available information.
Cons:
- Doesn’t work well when the percentage of missing values is higher. Each technique has its own disadvantages, so one must be careful when choosing a technique.
Using Algorithms that support missing values
- There are some ML algorithms that are robust to missing values in the dataset. For ex. KNN, Random Forest, XGboost.
Pros:
- No need to handle missing values in each column as ML algorithms will handle them efficiently.
Cons:
- No implementation of these ML algorithms in the scikit-learn library.
Popular Posts
Spread the knowledge