Missing values can be handled by deleting the rows or columns having null values.
If columns have more than half of the rows as null then the entire columns can be dropped.
The rows which are having one or more columns values as null can also dropped.
Pros:
A model trained with the removal of all missing values creates a robust model as it removes noise from data.
Easy to implement.
Cons:
Loss of a lot of information.
Works poorly if the percentage of missing values is higher.
Impute Missing Values
For numerical columns, the missing value can be replaced by mean, median, mode of the remaining value of columns.
For categorical columns, the missing value can be replaced by most frequent observation in the columns.
The regression & classification algorithm can be trained using remaining data and can be used for imputing missing values.
ML algorithm such as KNN, MICE library etc. can be used to impute missing value.
Pros:
Preserves all cases by replacing missing data with an estimated value based on the available information.
Cons:
Doesn’t work well when the percentage of missing values is higher. Each technique has its own disadvantages, so one must be careful when choosing a technique.
Using Algorithms that support missing values
There are some ML algorithms that are robust to missing values in the dataset. For ex. KNN, Random Forest, XGboost.
Pros:
No need to handle missing values in each column as ML algorithms will handle them efficiently.
Cons:
No implementation of these ML algorithms in the scikit-learn library.