Loss Functions and optimizers and its type?
With modelling, there’s a particular goal that the model needs to achieve. It’s just as important to achieve the best possible values of the model parameters as it is to find out what each parameter means in terms of that goal. The loss function (cost function) is minimized, therefore getting unknown values for weight and bias ensures you’re maximizing your target (y). The goal of minimizing the cost function/loss function/sum of squares is to find a model that satisfies the objective in this learning problem. For any model, there are numerous parameters and the structure of the model in prediction or classification is expressed in terms of parameters.
In order to evaluate your model, you first need to find the loss function. This can be done by minimizing it, which can act as a driving force for finding an optimal value for each parameter. The loss function most commonly used in regression is L1, while it is L2 which loses less. Loss functions which are most commonly used in classification are softmax cross entropy or sigmoid cross entropy.
Common loss functions
The following is a list of common loss functions:
tf.contrib.losses.log(predictions, labels, weight = 2.0)
Now you should be convinced that you need to use a loss function to get the best value of each parameter of the model. How can you get the best value?
Your initial idea will be most effective for finding the best values of the parameters by using linear regression. From there you need to figure out your path forward. The optimizer is a way to get the best bang for your buck by taking into account every available factor and finding the direction that offers the most value. It runs through different iterations each time and changes the value until it finds what’s best for you. Suppose you have 16 weight values (w1, w2, w3, …, w16) and 4 biases (b1, b2, b3, b4). Initially you can assume every weight and bias to be zero (or one or any number). The optimizer suggests whether w1 (and other parameters) should increase or decrease in the next iteration while keeping the goal of minimization in mind. After many iterations, w1 (and other parameters) would stabilize to the best value (or values).
TensorFlow and all related deep-learning frameworks have optimizers that slowly change the parameters to minimize the loss function. This enables them to give direction on how to set the weight and bias. Let’s assume you’re using deep learning to help train your networks. In each iteration, a certain number of weight values and bias values are changed in order to get the correct ones and minimize loss.
Selecting the best optimizer for the model to converge fast and to learn weights and biases properly is a tricky task.
So adaptive techniques (adadelta, adagrad, etc.) are good for optimizing complex neural networks. On average, Adam is usually the best option. It’s also the most outperforming adaptive technique in most cases and offers better optimization results. Methods such as SGD, NAG, & momentum aren’t always good for sparse data sets, but ADAPTIVE Learning RATE methods are. This makes it easier for you to implement, so you can spend the time focusing on something else. Additionally, it is probable that A-Learn will achieve the best results with no need for defining a learning rate.