Handle your missing data like a pro!
I’m sure you’ve heard that data scientists spend around 70% of their time in data cleaning. Basically, this means dealing with duplicated data, misspellings, transforming the data, looking for outliers, checking whether you have up to date data among others.
One very common feature of data cleaning that you will come across is Missing Values. This blog will look into what to do when you have missing rows of data.
The knowledge of how to deal with missing values is important, as using the wrong approach when you come across missing values can potentially lead to wrong insights or skew the results of your analysis.
Over the next few blogs, I will be writing about dealing with missing data because it will be a very long post if I put that into one blog post.
To start, this will be about one of the methods: deleting the records. (with example code.)
- DROPPING/DELETING THE MISSING ROWS
This is perhaps the most simple way of dealing with missing data. Basically, if you dont have the data, then why bother? The advantage here is, this approach is simple and straightforward.
When to use this method
- When you have a large data set — When your data set has millions and millions of rows of data, deleting a few rows with blank features will not really make a dent. However, if the data set is small, this is not advisable as that row represents a very important piece of insight.
- When most of the records are missing — Sometimes, you will have a situation where you have a lot of information missing. For example if you have most entries missing. If that is the case, it is then very difficult to
How Its done
Deleting rows/columns from your data set is done using the dropna() method. Lets see how that’s done.
- To begin with, I have imported some dummy data with some NaN/Null values.
- The 2nd last row has no data on the age, the city of residence, the weight and diabetes information. This presents once case in which it will make sense to drop that row of data.
Here, we can use the dropna() method to drop the rows with null values. Lets see what happens.
If you are keen, you will notice that both the rows with the ID 3 & 8 have been dropped. This is because the method dropna() by default removes all the rows that have any NaN/Null values in any of the entries. Basically, if the row has any null value at any data point, the row will be deleted.
This is because, by default, the dropna() method has a parameter how=’any’. So, by default it deletes all rows with the any missing data at any point. If you want to view the different parameters, you can do so here.
However, it is not always ideal to delete all the rows with any missing data. Say the most important information you need is the weight and diabetes information. And you need to set up your data such that if any of these pieces of information is missing, you can drop that row. This is where you will use another parameter in the dropna() method called the subset.
Here is how it works.
What the subset parameter above does is to drop the rows with any null values in the specified columns. As you can see, the row with ID 3 has a NaN value in the City of residence column but it is not deleted. So, if you want to drop the rows that are missing data in a specific column, use the subset parameter.
Also, if you want to drop a column, you just need to change the axis = 1 or axis=’columns’.
Finally, you have to realize that while you are doing all this, the original document is not overwritten. All it does is to provide a temporary copy of the dataframe. This is what it looks like if you check the data frame.
If you want to overwrite the original document, you’ll need to change a different parameter called the inplace .
By default, it is set to false. You can change that to true in order to overwrite the original document with the changes that you are making.
However, with the inplace=True parameter, the data is overwritten as seen below.