Missing Values Treatment methods in Machine Learning

Delete Missing Value Rows

  • Missing values can be handled by  deleting the rows or columns having null values.
  • If columns have more than half of the rows as null then the entire columns can be dropped.
  • The rows which are having one or more columns values as null can also dropped.

Pros:

  • A model trained with the removal of all missing values creates a robust model as it removes noise from data.
  • Easy to implement.

Cons:

  • Loss of a lot of information.
  • Works poorly if the percentage of missing values is higher.

Impute Missing Values

  • For numerical columns, the missing value can be replaced by mean, median, mode of the remaining value of columns.
  • For categorical columns, the missing value can be replaced by most frequent observation in the columns.
  • The regression & classification algorithm can be trained using remaining data and can be used for imputing missing values.
  • ML algorithm such as KNN, MICE library etc. can be used to impute missing value.

Pros:

  • Preserves all cases by replacing missing data with an estimated value based on the available information.

Cons:

  • Doesn’t work well when the percentage of missing values is higher. Each technique has its own disadvantages, so one must be careful when choosing a technique.

Using Algorithms that support missing values

  • There are some ML algorithms that are robust to missing values in the dataset. For ex. KNN, Random Forest, XGboost.

Pros:

  • No need to handle missing values in each column as ML algorithms will handle them efficiently.

Cons:

  • No implementation of these ML algorithms in the scikit-learn library.

Popular Posts

Spread the knowledge
 
  

Leave a Reply

Your email address will not be published. Required fields are marked *