Machine Learning Multicollinearity: How to Address It

Multicollinearity in Machine Learning Models

In this blog post, we will discuss the concept of multicollinearity in machine learning models. This is a common problem that data scientists encounter when fitting a machine learning model. We will explore what multicollinearity is, why it is a problem, and how to address it. So, let’s dive in!

What is Multicollinearity?

Multicollinearity refers to a scenario in which two or more independent variables in a dataset are highly correlated. Correlation measures the strength and direction of the linear relationship between two variables. For example, if we have a dataset that captures employee data, with one column representing the age of the employee and another column representing the number of years of experience, it is highly likely that these two variables will be positively correlated. As age increases, the number of years of experience also tends to increase. Similarly, there can be negative correlation between variables, such as age and the number of years left to retire. When two variables exhibit any kind of correlation, positive or negative, it is referred to as multi-collinearity.

Why is Multicollinearity a Problem?

Multicollinearity poses a problem in regression models, such as linear regression or logistic regression, because it affects the interpretation of the model coefficients. The purpose of a regression model is to understand how each independent variable impacts the target variable individually. When we write a regression equation, such as y = m1*x1 + m2*x2 + c, we can interpret the coefficients as follows: for every unit increase in x1, y will shift by m1, while keeping all other variables constant. Similarly, for x2, y will shift by m2. However, when multi-collinearity is present, the coefficients’ values become unreliable and their interpretation becomes challenging.

To understand why multicollinearity affects the coefficients, let’s consider a simple example. Suppose we have a regression equation for sales prediction, where the independent variables are the advertisements budget and the production quantity. Now, imagine a scenario where we accidentally capture the total advertisements budget as well as the TV advertisement budget as separate variables. Since the TV advertisement budget is a component of the total advertisements budget, these two variables will be highly correlated. This is an example of multicollinearity. When we fit a regression model with these variables, the coefficients’ values will be impacted, as the model cannot differentiate the individual impacts of the two variables due to their high correlation.

Ways to Address Multicollinearity

Now that we understand the problem of multicollinearity, let’s look at some ways to address it.

1. Removing Correlated Variables

If you have a limited number of features in your dataset, a simple approach is to create a correlation matrix. This matrix will show the correlation values between each pair of variables. By setting a threshold, such as 0.9, you can identify highly correlated variables. In such cases, it is advisable to remove one of the variables from the analysis. For example, if x1 and x2 are highly correlated, you can remove one of them to mitigate the multicollinearity problem.

2. Advanced Regression Techniques

Another approach is to use advanced regression techniques, such as Lasso and Ridge regression. These techniques penalize the model for duplicate information and shrink the coefficients. By applying these techniques, you can mitigate the impact of multicollinearity on the coefficients and improve the interpretability of the model.

Implementation in Python

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create a sample dataset with multicollinearity
np.random.seed(0)
n_samples = 100
x1 = np.random.rand(n_samples) * 10  # Independent variable 1
x2 = 0.8 * x1 + np.random.randn(n_samples)  # Independent variable 2 with high correlation to x1
y = 2 * x1 + 3 * x2 + np.random.randn(n_samples)  # Target variable

# Create a DataFrame
data = pd.DataFrame({'X1': x1, 'X2': x2, 'Y': y})

# Split the dataset into training and testing sets
X = data[['X1', 'X2']]
y = data['Y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Calculate Mean Squared Error (MSE) on test data
lr_predictions = lr_model.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_predictions)

print("Linear Regression MSE:", lr_mse)

# Output
# Linear Regression MSE: 0.8247709745620886

# Fit a Lasso Regression model (L1 regularization)
lasso_model = Lasso(alpha=1.0)  # You can adjust the alpha (regularization strength)
lasso_model.fit(X_train, y_train)

# Calculate MSE for Lasso model
lasso_predictions = lasso_model.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_predictions)

print("Lasso Regression MSE:", lasso_mse)

# Output
# Lasso Regression MSE: 1.1267388657834299

# Fit a Ridge Regression model (L2 regularization)
ridge_model = Ridge(alpha=1.0)  # You can adjust the alpha (regularization strength)
ridge_model.fit(X_train, y_train)

# Calculate MSE for Ridge model
ridge_predictions = ridge_model.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_predictions)

# Output
# Ridge Regression MSE: 0.8343187834718778

Explanation:

In this example, we create a sample dataset with two independent variables, X1 and X2, and a target variable Y. X2 is intentionally highly correlated with X1 to simulate multi-collinearity.
We split the dataset into training and testing sets to evaluate the models’ performance.
We fit three different regression models: Linear Regression, Lasso Regression (L1 regularization), and Ridge Regression (L2 regularization) to address multi-collinearity.
We calculate the Mean Squared Error (MSE) on the test data for each model. MSE is a measure of how well the model is performing, with lower values indicating better performance.

Conclusion

In this blog post, we discussed the concept of multicollinearity in machine learning models. We learned that multicollinearity occurs when two or more independent variables are highly correlated. This poses a problem in regression models, as it affects the interpretation of the coefficients. We’ve also included a Python example. To address multicollinearity, we can remove correlated variables or use advanced regression techniques like Lasso and Ridge regression. It is important to ensure that the coefficients accurately represent the individual impacts of the variables on the target variable. We hope this blog post has provided you with a clear understanding of multi-collinearity and its implications in machine learning models.

If you found this article helpful and insightful, I would greatly appreciate your support. You can show your appreciation by clicking on the button below. Thank you for taking the time to read this article.

Nomidl