How to Create an Image Classification Model using Hugging Face in 5 Lines

In 2021, when it comes to natural language processing tasks, most people turn to Hugging Face for solutions. However, did you know that Hugging Face now also offers image-related solutions? Yes, you heard it right! The popular Transformers library can now help you classify images as well. In this blog post, I will show you how to create an image classification model using just a few lines of code. So, without further ado, let’s get started!

Installing Transformers

To begin, you need to install the Transformers library. You can do this by running the following command:

pip install -q transformers

Uploading the Image

Next, you will need to upload the image you want to classify. If you are using Google Colab, you can easily upload the image by clicking on the “Files” button, selecting the image file (in this case, “image.jpg”), and clicking “Open” and then “OK”. Once the file is uploaded, you can proceed to the next step.

Reading the Image

Now, let’s read the uploaded image file as a NumPy array. Hugging Face uses vision transformers to make predictions on different classes in the image. The image is split into multiple non-overlapping sequences of fixed size, which are referred to as patches. These patches are then fed into the transformers encoder as input, and the final predictions are obtained. However, before we can do that, we need to resize, rescale, and normalize the image for the model using a vision transformer feature extractor. In the code snippet below, we are using a model that expects input images to be split into patches of size 16×16, with an overall image size of 224×224.

from transformers import ViTFeatureExtractor

model_name = "google/vit-base-patch16-224"
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)

Making Predictions

With the image and model set up, we can now make predictions. In the code snippet below, we pass the input image array to the model and extract the predictions. The output of the model is a tensor of shape 1×1000, where each element represents the probability score for a different class. To extract the exact class name, we find the index of the maximum value in the output tensor, which corresponds to the predicted class.

import torch

input_image = feature_extractor(images=image)["pixel_values"]
model = torch.hub.load("facebookresearch/deit:main", "deit_base_patch16_224", pretrained=True)
logits = model(input_image)
predicted_class_id = torch.argmax(logits)
predicted_class_name = model.config.id2label[predicted_class_id.item()]


In this example, the predicted class is “Football”, which matches the object present in the image.


In conclusion, Hugging Face not only provides amazing solutions for natural language processing tasks but has also ventured into computer vision. In this blog post, we demonstrated how to create an image classification model using the vision transformer from Hugging Face. I hope you found this tutorial informative.

Spread the knowledge

Leave a Reply

Your email address will not be published. Required fields are marked *