Pandas: Delete rows where column value is null
Data preprocessing is a crucial part of data analysis. There are different methods for handling missing data. Doing so effectively will determine the accuracy of your model.
I enjoy using the sklearn-pandas library when possible, as DataFrameMapper seemingly bridges the gap between scikit-learn's abilities and pandas data structures. Here, we will use this to avoid data imputation on a particular column and instead remove rows with missing values in this column from the dataset.
Do note that whether one should remove data versus the decision to substitute values should be decided on a case-by-case basis. The factors involving this decision are beyond the scope of this tutorial.
"Hello, I'm using the sklearn-pandas.DataFrameMapper to preprocess my data. I prefer not to impute values for a particular column. Instead, I want to exclude any rows where the column has a null value. How would I achieve this?"
I will answer this in a beginner-friendly way with in-depth comments explaining each line of code. If you're more advanced, you're totally free to ignore the comments.
For the sake of this tutorial, let df be the name of your original data frame, let df2 be the name of a temporary data frame, let df_clean be the name of your filtered dataset, let final_df be the name of your final dataset, and let xxx be the column name from which you would like to find and exclude empty or null values.
To start off, we will need to use pandas, as DataFrameMapper alone does not support this.
This is inefficient as processing power is wasted on data that won't contribute to further analysis.
It's a good learning opportunity to take a look at and discuss this code anyway:
? df2 = dfm.fit_transform(df)
import pandas as pd
df_clean = df.dropna(subset=['xxx'])
# here we're dropping all rows where column xxx
# has a null value and storing it in a new data frame
# called df_clean
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
#importing libraries
mapper = DataFrameMapper([
('xxx', StandardScaler()),
], df_out=True)
# Initialize DataFrameMapper with the necessary tuples
# Here we are applying StandardScalar to the xxx column
final_df = mapper.fit_transform(df.copy())
# df_out tells it to output to a dataframe
# and final_df is the name of the frame
Model building highly depends on effective data preprocessing. Data scientists will often find that the larger part of their job is simply cleaning data.
DataFrameMapper is just one tool in your toolbelt for achieving this. Don't be afraid to experiment and practice on a dummy set in order to learn new preprocessing strategies.
I enjoy using the sklearn-pandas library when possible, as DataFrameMapper seemingly bridges the gap between scikit-learn's abilities and pandas data structures. Here, we will use this to avoid data imputation on a particular column and instead remove rows with missing values in this column from the dataset.
Do note that whether one should remove data versus the decision to substitute values should be decided on a case-by-case basis. The factors involving this decision are beyond the scope of this tutorial.
Deleting rows where column[xxx] value is null
I received the following question regarding working on a dataset in pandas:"Hello, I'm using the sklearn-pandas.DataFrameMapper to preprocess my data. I prefer not to impute values for a particular column. Instead, I want to exclude any rows where the column has a null value. How would I achieve this?"
I will answer this in a beginner-friendly way with in-depth comments explaining each line of code. If you're more advanced, you're totally free to ignore the comments.
For the sake of this tutorial, let df be the name of your original data frame, let df2 be the name of a temporary data frame, let df_clean be the name of your filtered dataset, let final_df be the name of your final dataset, and let xxx be the column name from which you would like to find and exclude empty or null values.
To start off, we will need to use pandas, as DataFrameMapper alone does not support this.
What does "impute" mean?
In terms of data processing, imputation refers to the replacement of null or missing values with substituted values. Often, you might substitute a missing or null value with a mean, medium or mode. However, depending on the dataset, you might want to throw out rows containing missing or null values in a particular column, as shown in this tutorial.
Filtering before removing null/empty values
This method offers a straightforward, easy-to-understand solution that is best used on smaller datasets. Larger datasets containing numerous null values may cause unnecessary computations to be conducted on rows that will eventually be discarded.This is inefficient as processing power is wasted on data that won't contribute to further analysis.
It's a good learning opportunity to take a look at and discuss this code anyway:
Keeping Track of Variables
Variable | Definition |
|---|---|
df | the original dataframe |
df2 | temporary dataframe |
df_clean | filtered dataset |
final_df | final dataset |
xxx | column to filter |
# you can rename the instance of df2 to df if you're
# confident enough with your work to overwrite the df
# dataframe this is where the transformation takes place
df2_filtered = df2[~df2['xxx'].isnull()]
# here we iterate over the entire dataset AGAIN to remove
# rows that contain null or empty values in column xxx
df2_filtered = df2[~df2['xxx'].isnull()]
# here we iterate over the entire dataset AGAIN to remove
# rows that contain null or empty values in column xxx
Removing null/empty values
To more efficiently exclude rows where column xxx has a null value, first simply do the exclusion. After completing the exclusion step, your dataset may be considerably smaller. This allows the transformation process to iterate over the dataset much more quickly.import pandas as pd
df_clean = df.dropna(subset=['xxx'])
# here we're dropping all rows where column xxx
# has a null value and storing it in a new data frame
# called df_clean
Transforming the data
Now let's import DataFrameMapper from sklearn_pandas and StandardScalar from sklearn.preprocessing, map the data, and create our final dataframe.from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
#importing libraries
mapper = DataFrameMapper([
('xxx', StandardScaler()),
], df_out=True)
# Initialize DataFrameMapper with the necessary tuples
# Here we are applying StandardScalar to the xxx column
final_df = mapper.fit_transform(df.copy())
# df_out tells it to output to a dataframe
# and final_df is the name of the frame
Model building highly depends on effective data preprocessing. Data scientists will often find that the larger part of their job is simply cleaning data.
DataFrameMapper is just one tool in your toolbelt for achieving this. Don't be afraid to experiment and practice on a dummy set in order to learn new preprocessing strategies.
Comments
Post a Comment