DATA CLEANING
Data Science and the responsibilities associated with it! Any guesses as to what
am I trying to convey? Let me tell you, it’s all about analysing data. Well! That’s
not the fact and is totally irrelevant.
Visualize your minds for a minute or so and start imagining the amount of
data which is being made available today. Is it in lakhs or millions? No it isn’t. It
is more than that and sorting such data on a large scale would lead to high levels
of complexity. Therefore, the concept of Data Cleaning is considered as the first
and foremost step towards laying down a significant foundation in the field of
data science.
Now, coming back to the main aspect regarding what data cleaning is all about.
Does it mean removal of unwanted data to make space for new information? No!
Is the answer. So, what does the term “Data Cleaning” actually signify?
Data Cleaning mainly emphasizes on a procedure to produce data for
analysis and analytical problems by transforming the already existing datasets
which might contain noisy data, duplications, distortion, missing values,
problematic, complex and inconsistent data records or information. To have an
equalized and systematized datasets for establishing a simple and straightforward
way to generate accurate results is the main goal of data cleaning because it
ensures that all the incorrect stuff needs to be cleaned and filtered since no
organization would prefer having incomplete, unsound and unreliable data with
them. That is why it is suggested to clean data before moving ahead with
predicting results from the machine.
Steps involved in Data Cleaning:

  1. Variable identification
  2. Univariate analysis
  3. Bi-variate
  4. Missing values treatment
  5. Outlier treatment
  6. Variable transformation
  7. Variable creation
  8. Variable Identification
    This particular step refers to identifying the variable which stores values
    according to the dataset. The input and output variables are matched to generate
    the variable category. Accordingly, the data type is also identified.
    Example: Let’s say if we want to predict whether how many people from the
    Nagpur district are going to cast their votes in the upcoming elections.
    First of all, we’ll need to identify all the variables (Input and Output) and their
    data types as well.
    Fig: Variables available in the dataset
  9. Univariate Analysis
    This particular step refers to inspecting all the variables present in the dataset
    individually, i.e. on a one-by-one basis. But, the methods would vary depending
    upon the type of an individual variable. Variables can be either:
     Continuous: Statistical methods are used for analysing the variables.
     Categorical: Frequency table, histogram or a bar chart can be used in this
    case.
    Fig: Univariate Analysis
  10. Bi-Variate Analysis
    If there are two or more variables present in a dataset, then how can they be related
    to each other, i.e. the relationship between them is carried out by Bi-Variate
    Analysis.
    The main advantage is that Bi-Variate Analysis can also be used for both
    continuous and categorical variables.
    Fig: Bi-Variate Analysis
  11. Missing Values Treatment
    Since some important values are not available in the training dataset, then the
    behaviour between the other values or variables present in the dataset cannot be
    analysed appropriately.
    Therefore, the concept of Missing Values Treatment is used to analyse the
    total number of missing values.
    Fig: Analysing the no. of rows having missing data from a dataset.
  12. Outlier Treatment
    Because of outliers wrong estimations can take place. Hence keeping an eye on
    the outliers is of prime importance because with outliers being present, the pattern
    of the variables of a dataset can easily change, resulting into different types of
    patterns all over.
    Types of Outliers:
     Univariate: These outliers are found when only one variable distribution is
    taken into account.
     Multi-variate: These outliers are found in bulk and multi-dimensionally.
    The distribution pattern needs to be understood in order to find Multivariate outliers.
    Fig: Univariate v/s Multi-variate Outliers
  13. Variable Transformation
    This process refers to replacing a particular variable according to its function.
    When does this process take place?
     When the values of certain variables are changed for user’s needs and
    understanding. Data and variables present in the dataset are present in
    different scales, but variable transformation doesn’t change the shape and
    neither the scale of any variables available.
     Transformation allows for converting non-linear relations to linear
    relations. Thus, complexity decreases and analysing the information
    becomes simple.
    Fig: Variable Transformation
  14. Variable Creation
    The process to generate and come up with new variables from the set of already
    available variables is termed as Variable Creation.
    Methods available:
     Creating the derived variables: Here new variables are created, but by using
    different types of functions.
     Creating dummy variables: Used to convert a variable which is categorical
    in nature to numeric variables. Over here the predictor is the categorical
    variable which can take up only two values: 0 or 1.
    Fig: Variable Creation method for creating the adjustment report