In this post, I use Python, Pandas & Matplotlib to analyze and answer business questions about 12 months worth of sales data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc. The dataset can be downloaded here. In this analysis, Iâm using jupyter notebook.
First, we will import the necessary library like pandas, numpy, seaborn and matplotlib.

We use panda to read the csv file and create a dataframe from it. Now the first task is to merge all 12 monthsâ worth of sales data (12 csv files) into a single csv file. To do that, we need to import new library called os.
We need os library to read all csv filesâ title and call it using for loop. As you can see below.

It will take a little longer time because of heavy computation. But once itâs done. You will see a new csv file contains all 12 months data in your directory.
After we create this new csv file, you can delete the previous code (if you want) and we will use this file to answer all of the problems.
Now we only use this code to read all of 12 months data.

Now, we will check NaN values. There are NaN values in our data. You could spot on of NaN value in our dataset. Now we need to clean up the data by dropping rows of NaN.
We isnull().sum() to check the NaN values. Now, weâre gonna remove it from our dataframe using .dropna() method.

dropna() method is successful, but we get a new issue here. There are values âOrâ and in our Quantity Ordered columns one rows has âQuantity Orderedâ value in our data.
We need to drop this âOrâ and âQuantity Orderedâ rows just simply change the equal sign (â==â) to not equal sign (â!=â).

the next task is to convert these columns to the correct type (âQuantity Orderedâ is integer and âPrice Eachâ is float). Weâre gonna use pd.to_numeric() method to convert them to numeric.

We will create âMonthâ and âSalesâ column. Now, are we ready to answer the question? Not yet, we need obviously one more column called âSalesâ Column. How can we get that? We get âSalesâ by multiplying âQuantity Orderedâ and âPrice Eachâ values. Letâs create it.


Question 1. What was the best month for Sales? How much was earned that month?
We can easily answer it by we using groupby(âMonthâ).sum() method.


Question 2. Which city has highest number of sales?
we will be adding one new âcityâ column to answer this question.


Question 3. What time should we display advertisements to increase customer likelihood of buying product?
If weâre gonna use our data to answer this question, we need to aggregate the period in 24 hours distribution. In âOrder Dateâ column, there are times data. We could extract it like we did before. But to make it more consistent, we need to convert the âOrder Dateâ Column into date time object. Weâre gonna use pd.to_datetime() method.
It will take a little bit longer because of the heavy calculation. Now we can create a new column called âHourâ contain the extraction of âOrder_Dateâ data. We only need the hours data, so we can extract them by doing this.
Now we can answer the third question, what time should we display advertisements to maximize likelihood of customerâs buying product? To answer this, weâre gonna group it by the hours and counting all of the orders.


We can see that âOrder IDâ indicate the transaction. So by grouping the product by the Order ID, we are able to know what products are often sold together. Weâre gonna use duplicated() method to find a duplicate values of âOrder IDâ.
Now we want to create a column called âGroupedâ that contain example Google Phone and Wired Headphone (transaction 17650) at the same line. Weâre gonna use the transform() method to join values from two rows into a single row.
Itâs good, but we have an issue here. We have the same order at least twice because we did merge them in every situation in groupby without dropping the duplicate values. Now letâs drop the rows with duplicate values.
Now, we need to count the pair of products. We need new libraries because they have all we need to count all the combination of products bundle. Weâre gonna use itertools and collections libraries.
Itâs too messy, letâs just showing the top 10 data

