{"id":459,"date":"2023-06-10T01:28:09","date_gmt":"2023-06-09T20:28:09","guid":{"rendered":"https:\/\/immadshahid.com\/?p=459"},"modified":"2023-06-10T03:00:21","modified_gmt":"2023-06-09T22:00:21","slug":"data-cleaning-in-python","status":"publish","type":"post","link":"https:\/\/immadshahid.com\/blog\/data-cleaning-in-python\/","title":{"rendered":"Data Cleaning in Python"},"content":{"rendered":"\n<p>Data Cleaning is one of the most essential tasks while handling datasets. Often, the data is not clean, jumbled up, missing, duplicate, and unuseful data. Data needs to be cleaned before proceeding to the next task, Machine Learning. Machine Learning requires smooth and clean data in order to work. So, in this tutorial, we&#8217;ll be looking at the ways to clean data in Python using the pandas library.<\/p>\n\n\n\n<p>As we all know how to import the pandas library, i.e. <code>import pandas as pd<\/code><\/p>\n\n\n\n<p>Now we move on next. \ud83d\ude42<\/p>\n\n\n\n<p>First of all, we import, the dataset:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/immadshahid.com\/wp-content\/uploads\/2023\/06\/image-15-1024x524.png\" alt=\"\" class=\"wp-image-460\" width=\"598\" height=\"305\" srcset=\"https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-15-1024x524.png 1024w, https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-15-300x154.png 300w, https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-15-768x393.png 768w, https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-15.png 1045w\" sizes=\"(max-width: 598px) 100vw, 598px\" \/><\/figure>\n<\/div>\n\n\n<p>We took a dataset of customers, which shows the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Name<\/li>\n\n\n\n<li>Age<\/li>\n\n\n\n<li>Gender<\/li>\n\n\n\n<li>Education<\/li>\n\n\n\n<li>Country<\/li>\n\n\n\n<li>Income<\/li>\n\n\n\n<li>Purchase frequency<\/li>\n\n\n\n<li>Spending<\/li>\n<\/ul>\n\n\n\n<p>Now we move to our next tasks.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Handling Null Values<\/h2>\n\n\n\n<p>If we have Null values or NaN values in our dataset, we can check it by using the function <code>dataframe.isnull()<\/code>. It will show the dataset table, and if there are any NaN values in it, it will show True, otherwise, it will show False, as shown below:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" src=\"https:\/\/immadshahid.com\/wp-content\/uploads\/2023\/06\/image-16.png\" alt=\"\" class=\"wp-image-461\" width=\"462\" height=\"281\" srcset=\"https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-16.png 850w, https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-16-300x183.png 300w, https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-16-768x469.png 768w\" sizes=\"(max-width: 462px) 100vw, 462px\" \/><\/figure>\n<\/div>\n\n\n<p>In the afore-shown example, we can see that it shows False because there is no NaN value in it. However, if there exists any, we can fill it by using the<code> dataframe.fillna()<\/code> function. You can refer to this in the <a href=\"https:\/\/immadshahid.com\/pandas-cheat-sheet\/\" data-type=\"URL\" data-id=\"https:\/\/immadshahid.com\/pandas-cheat-sheet\/\">Pandas Cheatsheet<\/a> in our blog. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Checking the Datatypes of the Elements in the Dataset<\/h2>\n\n\n\n<p>Next, we shall come towards checking the datatypes of the elements in the dataset, which is very much important to check for our Machine Learning tasks, in particular. Below are the datatypes of the elements of the dataset, found by using the function <code>data.dtypes<\/code><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"352\" height=\"250\" src=\"https:\/\/immadshahid.com\/wp-content\/uploads\/2023\/06\/image-17.png\" alt=\"\" class=\"wp-image-462\" srcset=\"https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-17.png 352w, https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-17-300x213.png 300w\" sizes=\"(max-width: 352px) 100vw, 352px\" \/><\/figure>\n<\/div>\n\n\n<p>It is important to note that we need to have either int64 datatype or float64 datatype in order to proceed in our Machine Learning tasks because it doesn&#8217;t accept a string value, it gives an error. For this reason, we need to, at times change the datatype or we just simply encode it to work in our Machine Learning models. There is also a need to remove the white spaces which can also cause errors while being handled.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Finding Duplicates<\/h2>\n\n\n\n<p>Another feature of data cleaning involves the specification of duplicate values, which can be removed from the dataset as well. For finding the duplicates, we use the function <code>dataframe.duplicate()<\/code>.<\/p>\n\n\n\n<p>In the output, it shows that if True, it means there exist duplicate values, otherwise, it will show you False.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Removing Columns<\/h2>\n\n\n\n<p>Another important feature used in data cleaning is, that, if a column is not needed, or is extra, we can remove it, by using the function, <code>dataframe.drop(to_drop, inplace=True, axis=1)<\/code>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Detecting the Outliers<\/h2>\n\n\n\n<p>Numbers that considerably deviate from the statistical mean are considered outliers. They are data points that are sufficiently out of range that they are probably misreads, to cut down on superfluous science jargon.<\/p>\n\n\n\n<p>They must be eliminated, just like duplicates. Pulling up our dataset first, let&#8217;s look for an outlier.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/immadshahid.com\/wp-content\/uploads\/2023\/06\/image-18.png\" alt=\"\" class=\"wp-image-463\" width=\"388\" height=\"269\" srcset=\"https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-18.png 517w, https:\/\/immadshahid.com\/blog\/wp-content\/uploads\/2023\/06\/image-18-300x208.png 300w\" sizes=\"(max-width: 388px) 100vw, 388px\" \/><\/figure>\n<\/div>\n\n\n<p>Here we described the &#8216;spending&#8217; column of the dataset, which gives the above result. We can see that min is <code>5020.425000<\/code> and max is <code>25546.500000<\/code> , which pretty much shows that this dataset is balanced enough to move on. In cases, where there exist outliers, there is a need to specify that outlier and remove it, in order to get better results in Predictive Analysis, or else we can get inaccurate results. So, to improve accuracy, it is important to identify outliers and remove them. We can also identify or check the outliers by plotting a box and whisker diagram using the Seaborn data visualization library\/module.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Wind-up<\/h2>\n\n\n\n<p>The aforementioned techniques are just some basic techniques for data cleaning. Other techniques depend on the type of dataset. If the dataset is balanced with no outliers like it was in our example, then we should go on, otherwise, we should deal with\/handle the dataset accordingly, either by handling the missing values, duplicate values or if there exists any outlier. Other things include datatype synchronisation and other major steps to make the data smooth enough for our future activities such as Machine Learning, Predictive Analysis, etc. I hope you got an insight into Data Cleaning, and why is it important, when one is doing Machine Learning activities, and\/or other tasks related to it. <\/p>\n\n\n\n<p>I hope you find this article helpful. Please give your feedback in the comments below, or through my email immadshahid@gmail.com.<\/p>\n\n\n\n<p>Reach me out on Facebook, Instagram, Twitter and Linkedin for more updates.<\/p>\n\n\n\n<p>\ud83d\ude42 <\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data Cleaning is one of the most essential tasks while handling datasets. Often, the data&hellip;<\/p>\n","protected":false},"author":1,"featured_media":465,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[130,124,143,128,138,92,54,139,141,21,121,136,25,129,125,127,137,122,123],"tags":[142,144,131,126],"class_list":["post-459","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cheat-sheet","category-data-analytics","category-data-clean9ing","category-data-science","category-data-visualization","category-earning-and-learning","category-education","category-graphs","category-histograms","category-internet","category-machine-learning","category-matplotlib","category-media","category-pandas","category-predictive-analysis","category-python","category-seaborn","category-sklearn","category-training-a-dataset","tag-data-cleaning","tag-machine-learning","tag-pandas","tag-python"],"_links":{"self":[{"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/posts\/459","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/comments?post=459"}],"version-history":[{"count":1,"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/posts\/459\/revisions"}],"predecessor-version":[{"id":464,"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/posts\/459\/revisions\/464"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/media\/465"}],"wp:attachment":[{"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/media?parent=459"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/categories?post=459"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/immadshahid.com\/blog\/wp-json\/wp\/v2\/tags?post=459"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}