DATA CLEANING

Data Science and the responsibilities associated with it! Any guesses as to what

am I trying to convey? Let me tell you, it’s all about analysing data. Well! That’s

not the fact and is totally irrelevant.

Visualize your minds for a minute or so and start imagining the amount of

data which is being made available today. Is it in lakhs or millions? No it isn’t. It

is more than that and sorting such data on a large scale would lead to high levels

of complexity. Therefore, the concept of Data Cleaning is considered as the first

and foremost step towards laying down a significant foundation in the field of

data science.

Now, coming back to the main aspect regarding what data cleaning is all about.

Does it mean removal of unwanted data to make space for new information? No!

Is the answer. So, what does the term “Data Cleaning” actually signify?

Data Cleaning mainly emphasizes on a procedure to produce data for

analysis and analytical problems by transforming the already existing datasets

which might contain noisy data, duplications, distortion, missing values,

problematic, complex and inconsistent data records or information. To have an

equalized and systematized datasets for establishing a simple and straightforward

way to generate accurate results is the main goal of data cleaning because it

ensures that all the incorrect stuff needs to be cleaned and filtered since no

organization would prefer having incomplete, unsound and unreliable data with

them. That is why it is suggested to clean data before moving ahead with

predicting results from the machine.

Steps involved in Data Cleaning:

- Variable identification
- Univariate analysis
- Bi-variate
- Missing values treatment
- Outlier treatment
- Variable transformation
- Variable creation
- Variable Identification

This particular step refers to identifying the variable which stores values

according to the dataset. The input and output variables are matched to generate

the variable category. Accordingly, the data type is also identified.

Example: Let’s say if we want to predict whether how many people from the

Nagpur district are going to cast their votes in the upcoming elections.

First of all, we’ll need to identify all the variables (Input and Output) and their

data types as well.

Fig: Variables available in the dataset - Univariate Analysis

This particular step refers to inspecting all the variables present in the dataset

individually, i.e. on a one-by-one basis. But, the methods would vary depending

upon the type of an individual variable. Variables can be either:

Continuous: Statistical methods are used for analysing the variables.

Categorical: Frequency table, histogram or a bar chart can be used in this

case.

Fig: Univariate Analysis - Bi-Variate Analysis

If there are two or more variables present in a dataset, then how can they be related

to each other, i.e. the relationship between them is carried out by Bi-Variate

Analysis.

The main advantage is that Bi-Variate Analysis can also be used for both

continuous and categorical variables.

Fig: Bi-Variate Analysis - Missing Values Treatment

Since some important values are not available in the training dataset, then the

behaviour between the other values or variables present in the dataset cannot be

analysed appropriately.

Therefore, the concept of Missing Values Treatment is used to analyse the

total number of missing values.

Fig: Analysing the no. of rows having missing data from a dataset. - Outlier Treatment

Because of outliers wrong estimations can take place. Hence keeping an eye on

the outliers is of prime importance because with outliers being present, the pattern

of the variables of a dataset can easily change, resulting into different types of

patterns all over.

Types of Outliers:

Univariate: These outliers are found when only one variable distribution is

taken into account.

Multi-variate: These outliers are found in bulk and multi-dimensionally.

The distribution pattern needs to be understood in order to find Multivariate outliers.

Fig: Univariate v/s Multi-variate Outliers - Variable Transformation

This process refers to replacing a particular variable according to its function.

When does this process take place?

When the values of certain variables are changed for user’s needs and

understanding. Data and variables present in the dataset are present in

different scales, but variable transformation doesn’t change the shape and

neither the scale of any variables available.

Transformation allows for converting non-linear relations to linear

relations. Thus, complexity decreases and analysing the information

becomes simple.

Fig: Variable Transformation - Variable Creation

The process to generate and come up with new variables from the set of already

available variables is termed as Variable Creation.

Methods available:

Creating the derived variables: Here new variables are created, but by using

different types of functions.

Creating dummy variables: Used to convert a variable which is categorical

in nature to numeric variables. Over here the predictor is the categorical

variable which can take up only two values: 0 or 1.

Fig: Variable Creation method for creating the adjustment report