Data Cleaning in Python

Data Cleaning in Python

Data Cleaning is one of the most essential tasks while handling datasets. Often, the data is not clean, jumbled up, missing, duplicate, and unuseful data. Data needs to be cleaned before proceeding to the next task, Machine Learning. Machine Learning requires smooth and clean data in order to work. So, in this tutorial, we’ll be looking at the ways to clean data in Python using the pandas library.

As we all know how to import the pandas library, i.e. import pandas as pd

Now we move on next. 🙂

First of all, we import, the dataset:

We took a dataset of customers, which shows the following:

  • Name
  • Age
  • Gender
  • Education
  • Country
  • Income
  • Purchase frequency
  • Spending

Now we move to our next tasks.

Handling Null Values

If we have Null values or NaN values in our dataset, we can check it by using the function dataframe.isnull(). It will show the dataset table, and if there are any NaN values in it, it will show True, otherwise, it will show False, as shown below:

In the afore-shown example, we can see that it shows False because there is no NaN value in it. However, if there exists any, we can fill it by using the dataframe.fillna() function. You can refer to this in the Pandas Cheatsheet in our blog.

Checking the Datatypes of the Elements in the Dataset

Next, we shall come towards checking the datatypes of the elements in the dataset, which is very much important to check for our Machine Learning tasks, in particular. Below are the datatypes of the elements of the dataset, found by using the function data.dtypes

It is important to note that we need to have either int64 datatype or float64 datatype in order to proceed in our Machine Learning tasks because it doesn’t accept a string value, it gives an error. For this reason, we need to, at times change the datatype or we just simply encode it to work in our Machine Learning models. There is also a need to remove the white spaces which can also cause errors while being handled.

Finding Duplicates

Another feature of data cleaning involves the specification of duplicate values, which can be removed from the dataset as well. For finding the duplicates, we use the function dataframe.duplicate().

In the output, it shows that if True, it means there exist duplicate values, otherwise, it will show you False.

Removing Columns

Another important feature used in data cleaning is, that, if a column is not needed, or is extra, we can remove it, by using the function, dataframe.drop(to_drop, inplace=True, axis=1).

Detecting the Outliers

Numbers that considerably deviate from the statistical mean are considered outliers. They are data points that are sufficiently out of range that they are probably misreads, to cut down on superfluous science jargon.

They must be eliminated, just like duplicates. Pulling up our dataset first, let’s look for an outlier.

Here we described the ‘spending’ column of the dataset, which gives the above result. We can see that min is 5020.425000 and max is 25546.500000 , which pretty much shows that this dataset is balanced enough to move on. In cases, where there exist outliers, there is a need to specify that outlier and remove it, in order to get better results in Predictive Analysis, or else we can get inaccurate results. So, to improve accuracy, it is important to identify outliers and remove them. We can also identify or check the outliers by plotting a box and whisker diagram using the Seaborn data visualization library/module.

Wind-up

The aforementioned techniques are just some basic techniques for data cleaning. Other techniques depend on the type of dataset. If the dataset is balanced with no outliers like it was in our example, then we should go on, otherwise, we should deal with/handle the dataset accordingly, either by handling the missing values, duplicate values or if there exists any outlier. Other things include datatype synchronisation and other major steps to make the data smooth enough for our future activities such as Machine Learning, Predictive Analysis, etc. I hope you got an insight into Data Cleaning, and why is it important, when one is doing Machine Learning activities, and/or other tasks related to it.

I hope you find this article helpful. Please give your feedback in the comments below, or through my email immadshahid@gmail.com.

Reach me out on Facebook, Instagram, Twitter and Linkedin for more updates.

🙂

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *