0
3.2kviews
What are the different ways to handle missing values
1 Answer
0
129views

Missing Values

  • Missing values are those values or data that are not stored or not present in the given database.
  • In the database, there are some instances where a particular element is absent because of various reasons, such as corrupt data, failure to load the information, or incomplete extraction.
  • In the database, blank, null, or NaN shows the missing values.
  • But, handling such missing values is one of the big challenges.
  • To ease the process of handling messing data or values it is very important to analyze each column with missing values carefully.
  • So that reasons behind the missing values may be found.
  • This will help to develop an appropriate strategy for handling the missing values.
  • In general, there are various strategies are used to handle the missing values.
  • Some of them are as follows:

    • Deleting Rows or Columns with missing values
    • Replacing missing values With Mean/Median/Mode
    • Imputation method for categorical columns
    • Predicting The Missing Values
    • Using Algorithms Which Support Missing Values

1] Deleting Rows or Columns with missing values

  • Missing values can be handled by deleting the rows or columns having null values.
  • If columns have more than half of the rows as null then the entire column can be dropped.
  • The rows which are having one or more columns values as null can also be dropped.

Advantages -

  • Complete removal of data with missing values generates a robust and more accurate model.
  • Deleting a particular row or a column with no specific information is good because it does not have any weightage.

Disadvantages -

  • More information may be lost.
  • Appropriate choice only when missing values are high in percentage like 30%, compared with the complete database.

2] Replacing missing values With Mean/Median/Mode

  • Columns in the database which are having numeric continuous values can be replaced with the mean, median, or mode of remaining values in the column.
  • This method can prevent the loss of data compared to the previous method.
  • Replacing approximations like mean, median, or mode is a statistical approach to handle the missing values.

Advantages -

  • It prevents huge data loss that may be caused due to the removal of the complete rows and columns.
  • It is a very useful approach for small-sized databases.

Disadvantages -

  • Applicable only when attributes are numeric.
  • Possibility for data leakage.
  • Can create variance and bias in the database.
  • Works poorly compared to other multiple-imputations methods.

3] Imputation method for categorical columns

  • When missing values is from categorical columns like string or numerical then the missing values can be replaced with the most frequent category.
  • If the number of missing values is very large then it can be replaced with a new category.

Advantages -

  • It prevents huge data loss that may be caused due to the removal of the complete rows and columns.
  • It is a very useful approach for small-sized databases.
  • Prevent the loss of data by adding a unique category.

Disadvantages -

  • Used only for categorical attributes.
  • Addition of new features to the model while encoding, may result in poor performance
  • Adds less variance.

4] Predicting The Missing Values

  • In this other features are used which don’t have nulls can be used to predict missing values.
  • The regression or classification model can be used for the prediction of missing values depending on the nature whether categorical or continuous for the missing value.

Advantages -

  • Gives a better result than previous methods.
  • Takes into consideration the covariance between the missing value column and other columns.
  • Creates unbiased estimates of the model parameters.

Disadvantages -

  • Considered only as a proxy for the true values.
  • Bias also arises when an incomplete conditioning set is used for a categorical variable.

5] Using Algorithms Which Support Missing Values

  • All the machine learning algorithms don’t support missing values but some ML algorithms are robust to missing values in the dataset.
  • The k-NN Algorithm can ignore a column from a distance measure when a value is missing.
  • The Naive Bayes can also support missing values when making a prediction.
  • The Random Forest works well on non-linear and categorical data. - It adapts to the data structure taking into consideration the high variance or the bias, producing better results on large datasets.
  • These algorithms can be used when the dataset contains null or missing values.

Advantages -

  • No need to handle missing values for each column as machine learning algorithms will handle them efficiently.

Disadvantages -

  • Is a very time-consuming process.
  • Choice of distance functions can be Euclidean, Manhattan, etc. which is do not yield a robust result.
Please log in to add an answer.