Handle Missing Data– Create and Manage Batch Processing and Pipelines

In Exercise 5.5 you worked with a CSV file that contained null values. The action in that exercise to avoid the null values was to remove the record from the dataset. Depending on the data you are processing for finding business insights, this removal of the data can have an impact on the data processing outcome. A very important step to take before deciding how to handle missing data is to determine why the data is missing. This can be determined using the following three principles:

Missing at random (MAR): The missing data can be accounted for using other variables—for example, a disruption during a meditation session or the readjustment of the BCI on the head.
Missing completely at random (MCAR): The missing data is not related to the observation being studied—for example, the BCI ran low on battery and missed or wrongly logged some readings.
Missing not at random (MNAR): The missing data is caused by an unwillingness or inability to provide—for example, a test subject does not want their brain waves captured.

Missing data that falls into the MNAR category is the most problematic. MNAR data scenarios are an issue because the missing data can produce bias in the model if the entire reading is deleted. The following table contains some readings with missing data. Assume they are missing due to an MNAR scenario—for example, the subject moved the electrode to scratch their head for a brief moment.

When missing data falls into the MAR and MCAR scenarios, it is okay to delete the reading because the reason for the missing data is identifiable. However, in the case of MNAR, the approach to handle this missing data is referred to as imputation. Imputation means that instead of removing the data, you replace it with a substitute value. There are many approaches for acquiring this missing data, from taking an educated guess to linear interpolation, which is a very sophisticated approach. A method that falls into the middle range of complexity is to use the mean, median, or mode to calculate a value for the missing measurement. Using any of those statistical assessments has the advantage of not changing the mean aggregate value of the session. The median value per reading and frequency have been added to the table and will not have an impactful effect on the final outcome.

Note that if you had removed the row where the two ALPHA readings were missing, it would have impacted THETA considerably. Removing those two rows would result in the median for THETA changing from 99.938 to 107.136.

Study guide for Exam DP-203: Data Engineering

Bill Mettler

Leave a Reply Cancel reply

Related Posts

Implement Version Control for Pipeline Artifacts – Create and Manage Batch Processing and Pipelines

Trigger Batches – Create and Manage Batch Processing and Pipelines

Implement Azure Synapse Link and Query the Replicated Data – Create and Manage Batch Processing and Pipelines

Bill Mettler

Leave a Reply Cancel reply