Machine Learning Interview questions Part -3
1 – Explain the difference between Variance and R squared error?
Variance is a statistical measure of the dispersion of a distribution. It is often used in statistics to measure how much variation or “dispersion” there is from the mean. Variance can be calculated as the average squared deviation from the mean, which for a set of data points means summing up all the squared differences between each data point and the mean and then dividing by one less than the number of data points. For example, suppose the average height of adult males in America is 70 inches and the standard deviation is 5 inches. The variance would then be (70-5)2/36=144Variances are often expressed as a percentage of the mean. In this case, the percentage is 100%.
R squared error is a statistical measure that quantifies how well different variables predict an outcome variable. It measures how closely two variables are related to each other, with values ranging from 0 (no relationship) to 1 (perfect relationship). e^(x) = e The R² is considered to be the coefficient of determination. All else equal, the higher the R², the more accurate a model is.
2 – Difference between Normalization and Standardization?
Normalization is the process of reducing the variability in a dataset. It is used to make datasets comparable by rescaling them so that they have the same range and scale. Normalization is often used to reduce the variability present in data, including differences in numbers that are too large or too small. The general process of normalization involves dividing each value by the standard deviation. This is done on a group level, not just one particular observation. Doing so will take a lot of the smaller deviations and make them into consistent and more noticeable
Standardization is a process that may be applied to data to make it more suitable for use in the machine learning process. It involves making sure that all variables are on an equal scale and have no bias. The term “standardization” is also used more loosely to mean a process of making something standardized, such as linguistics. In the context of industrial statistics, standardization is the process of transforming a measured variable to have a mean of zero and standard deviation of one. Data that is not standardized is said to be “not normal” or “non-normal” because it does not conform to the bell curve’s shape.
Normalization and Standardization are not always interchangeable. In some cases, it might be beneficial to use Normalization instead of Standardization, while in other cases it might be better to use Standardization instead of Normalization.
3 – Entropy and information gain in Decision tree algorithm?
Decision Tree Algorithm is a machine learning algorithm which is used for classification and regression. It is based on the principle of divide and conquer and it has many advantages over other algorithms.
In machine learning, the entropy is a measure of uncertainty in a random variable. The entropy of a probability distribution is the average information content per unit of probability, measured in bits.
The information gain is a measure of how much extra information is gained by using one observation to predict another.
In decision tree algorithm, it calculates the entropy and then calculates the information gain to decide which node should be selected next.
4 – Difference between Gradient boosting and Random Forest?
Gradient boosting is a type of machine learning technique that is used to create accurate predictions. It is also called as gradient boosting machines. or gradient boosting networks (GBNs). Gradient boosting is a machine learning technique that uses gradient descent to find a local minimum of the error function. Gradient boosting is often used in predictive modelling problems, such as text classification and regression. The principle of gradient boosting is that an estimate of the model’s current error function value can be obtained by calculating gradients.
Random forest is a machine learning algorithm that uses decision trees to generate accurate predictions. Which of the following is an example of a decision tree? A. A tree diagram with three branches and three nodes each with a label displayed on it. B. Tree diagrams with single branches and two or more nodes each labeled with a value to be shown on the tree diagram. C. Tree diagrams that have been created by dividing a space into two regions.
5 – Explain about optimizers?
Optimizers are used by machine learning and deep learning algorithms to tune their models. Optimizers help in finding the best parameters for a given model. architecture and an input dataset, which is considered to be a fixed-size vector. Deep neural networks typically use one or more optimizer algorithms to produce a best-fit model with the least number of parameters. The most commonly used optimizers are gradient descent, restricted Boltzmann machines, mini batch stochastic gradient descent (mini-batch SGD).
The goal of an optimizer is to minimize the loss function by optimizing the parameters of a model.
6 – Difference between a multi-label classification problem and a multi-class classification problem?
A multi-label classification problem is when the data has more than two labels. For example, when a customer buys something from an online store, they may be given five stars for their purchase. In this case, the customer can be classified as satisfied or dissatisfied with their purchase.
A multi-class classification problem is when the data only has two possible labels. For example, if you were to ask someone if they were happy or not happy with their purchase from an online store, they would only have two possible responses – happy and not happy.
7 – Goal of A/B Testing?
The goal of A/B testing is to identify the best performing variation in a given scenario. This means that you are comparing two versions of the same thing and trying to figure out which one performs better.
In order to do this, you need to set up a hypothesis, collect data, and then use statistical methods to determine which version is better.
8 – What is Type | and Type || error?
Type || and Type | are two types of errors that can occur in machine learning models.
Type | is also known as a false negative. It is when the model predicts something that does exist but fails to predict it correctly. For example, an email might be marked as not spam, but it really should be flagged as such.
Type || is also known as a false positive. It is when the model predicts something that does not exist. For example, an email might be marked as spam, but it is really not spam.
9 – How to reduce overfitting?
Overfitting means that a model has been trained on too small of a dataset and is not generalizable to the real world.
1- Reduce number of features: One way to reduce overfitting is by reducing the number of features in the model. This is done by removing irrelevant or redundant features.
2- Regularization: Another way to reduce overfitting is by regularization, which involves adding an additional penalty term for complexity in order to prevent overfitting. It can be done through the use of L2 or L1 penalties.
3- Choose a more robust algorithm: A more robust algorithm will also help reduce overfitting, since it will have less parameters that can be adjusted to fit any data set.
10 – What is Grid Search and Random Search?
Grid search is a type of optimization algorithm that finds the best combination of values for a set of parameters. The parameters are all varied systematically and their impact on the model’s performance is measured.
Random search is an algorithm that generates random combinations of parameter values and evaluates their performance. The parameter space is explored randomly in order to find the best combination.
11 – How do you ensure you are not overfitting a model?
There are three common ways to prevent overfitting:
1. Regularization: A technique that penalizes complex models by adding a term to the loss function that increases with the number of parameters in the model.
2. Cross-validation: The process of dividing data into subsets, and then using one subset as a test set and another subset as a training set, in order to assess how well a model will generalize.
3. Early stopping: The process of terminating training with an algorithm before overfitting occurs, based on an evaluation of the performance on data held out from training (a validation set).
12 – How do you fix high variance in a model?
The variance in a model is the difference between the expected value and the predicted value. When this difference is large, we say that there is high variance. In machine learning, it can be caused by overfitting.
In order to fix high variance in a model, we can reduce the complexity of the model or increase the size of training data set.
13 – What are Hyperparameters? How do they differ from model parameter?
Hyperparameters are the parameters of a machine learning algorithm that are set before the training process begins. Unlike model parameters, hyperparameters are not learned from training data. Hyperparameters can be tuned during training. They are not a part of the model and don’t affect the model’s predictions.
The most common hyperparameters include the learning rate, number of hidden layers in a neural network, and dropout rates for deep neural networks.
Different hyperparameters are used for different machine learning algorithms. For example, K-nearest neighbor algorithm has its own set of hyperparameters while linear regression has its own set of hyperparameters.
The difference between hyperparameters and parameters is that hyperparameters do not affect the prediction of a model, but they affect how well it performs.
14 – Kernels in svm, there difference
Kernel Function in SVM is a method used to take data as input and transform into the required from processing data.
Gaussian Kernel Radial Basis Function (RBF) – it is used to perform transformation, when there is no prior knowledge about data and it uses radial basis method to improve the transformation.
Sigmoid kernel – this function is equivalent to a two-layer, perceptron model of neural network, which is used as activation function for artificial neurons.
Polynomial kernel – it represents the similarity of vectors in training set of data in a feature space over polynomials of the original variables used in kernel.
Linear kernel – used when data is linearly separable.
15 – How do you handle categorical data?
One-hot Encoding is the most common, correct way to deal with non-ordinal categorical data. It consists of creating an a additional feature for each group of the categorical feature and mark each observation belonging (Value = 1) or not (Value = 0) to that group.