Data Science and the responsibilities associated with it! Any guesses as to what
am I trying to convey? Let me tell you, it’s all about analysing data. Well! That’s
not the fact and is totally irrelevant.
Visualize your minds for a minute or so and start imagining the amount of
data which is being made available today. Is it in lakhs or millions? No it isn’t. It
is more than that and sorting such data on a large scale would lead to high levels
of complexity. Therefore, the concept of Data Cleaning is considered as the first
and foremost step towards laying down a significant foundation in the field of
Now, coming back to the main aspect regarding what data cleaning is all about.
Does it mean removal of unwanted data to make space for new information? No!
Is the answer. So, what does the term “Data Cleaning” actually signify?
Data Cleaning mainly emphasizes on a procedure to produce data for
analysis and analytical problems by transforming the already existing datasets
which might contain noisy data, duplications, distortion, missing values,
problematic, complex and inconsistent data records or information. To have an
equalized and systematized datasets for establishing a simple and straightforward
way to generate accurate results is the main goal of data cleaning because it
ensures that all the incorrect stuff needs to be cleaned and filtered since no
organization would prefer having incomplete, unsound and unreliable data with
them. That is why it is suggested to clean data before moving ahead with
predicting results from the machine.
Steps involved in Data Cleaning:
- Variable identification
- Univariate analysis
- Missing values treatment
- Outlier treatment
- Variable transformation
- Variable creation
- Variable Identification
This particular step refers to identifying the variable which stores values
according to the dataset. The input and output variables are matched to generate
the variable category. Accordingly, the data type is also identified.
Example: Let’s say if we want to predict whether how many people from the
Nagpur district are going to cast their votes in the upcoming elections.
First of all, we’ll need to identify all the variables (Input and Output) and their
data types as well.
Fig: Variables available in the dataset
- Univariate Analysis
This particular step refers to inspecting all the variables present in the dataset
individually, i.e. on a one-by-one basis. But, the methods would vary depending
upon the type of an individual variable. Variables can be either:
Continuous: Statistical methods are used for analysing the variables.
Categorical: Frequency table, histogram or a bar chart can be used in this
Fig: Univariate Analysis
- Bi-Variate Analysis
If there are two or more variables present in a dataset, then how can they be related
to each other, i.e. the relationship between them is carried out by Bi-Variate
The main advantage is that Bi-Variate Analysis can also be used for both
continuous and categorical variables.
Fig: Bi-Variate Analysis
- Missing Values Treatment
Since some important values are not available in the training dataset, then the
behaviour between the other values or variables present in the dataset cannot be
Therefore, the concept of Missing Values Treatment is used to analyse the
total number of missing values.
Fig: Analysing the no. of rows having missing data from a dataset.
- Outlier Treatment
Because of outliers wrong estimations can take place. Hence keeping an eye on
the outliers is of prime importance because with outliers being present, the pattern
of the variables of a dataset can easily change, resulting into different types of
patterns all over.
Types of Outliers:
Univariate: These outliers are found when only one variable distribution is
taken into account.
Multi-variate: These outliers are found in bulk and multi-dimensionally.
The distribution pattern needs to be understood in order to find Multivariate outliers.
Fig: Univariate v/s Multi-variate Outliers
- Variable Transformation
This process refers to replacing a particular variable according to its function.
When does this process take place?
When the values of certain variables are changed for user’s needs and
understanding. Data and variables present in the dataset are present in
different scales, but variable transformation doesn’t change the shape and
neither the scale of any variables available.
Transformation allows for converting non-linear relations to linear
relations. Thus, complexity decreases and analysing the information
Fig: Variable Transformation
- Variable Creation
The process to generate and come up with new variables from the set of already
available variables is termed as Variable Creation.
Creating the derived variables: Here new variables are created, but by using
different types of functions.
Creating dummy variables: Used to convert a variable which is categorical
in nature to numeric variables. Over here the predictor is the categorical
variable which can take up only two values: 0 or 1.
Fig: Variable Creation method for creating the adjustment report