Revert Data to a Previous State – Create and Manage Batch Processing and Pipelines

As mentioned numerous times in this book, batch jobs and pipelines fail. Sometimes a failure has no impact on the data in your data lake, but in some cases the failure can cause serious problems. For example, if the parameters for a DELETE command are wrong, if the insertion of data only partially completes, or if the creation of a delta table fails due to a type mismatch, your data can be seriously corrupted, for a number of reasons. As you learned in Chapter 4, Exercise 4.4, point‐in‐time restore (PITR) depends on a snapshot for creating a restore point. For dedicated SQL pools, the creation of restore points happens automatically. Therefore, if your pipeline runs result in the corruption of data, you can follow the steps described in Exercise 4.4 to recover from the most recent backup. Another important concept to remember from Chapter 4 is the recovery point objective (RPO), which defines the length of time between failure and expected recovery. The final concept to call out, which was introduced in Chapter 4 and pertains to regressions, is a soft delete in the context of an ADLS container. This is a built‐in data protection feature that prevents data loss when files or folders are unintentionally deleted.

In Exercise 3.15 you created an Azure Databricks delta lake. A delta lake has numerous beneficial capabilities. For example, ACID transactions and time travel are both relevant features when it comes to preventing regressions or recovering from them. When a database transition is considered ACID, the transaction is guaranteed to either be completed successfully or be rolled back successfully, in both of which cases there is no possibility of corrupting data. The time travel feature uses the versionAsOf option as a pointer to previous versions of delta tables. These versions are useful for determining the impact of failed activities run on the tables and provide the option to restore from a previous version. This book has discussed recovering data and rolling back data, which is synonymous with regressing to a previous state, but what happens if you cannot roll back but instead need to repair the data that requires the cleansing?

Data Cleansing

Cleansing data means removing or modifying content in your data lake that exists in unwanted data states. Duplicates, format issues, spelling errors, corrupted data, or old or irrelevant data are all examples of data you would not want in your data lake. How does this data end up in your data lake? There can be many reasons, such as human error, malicious exploits, or being entered from an upstream data source. Another common scenario of data corruption can be caused by exceptions somewhere along your batch job solution path. When your batch job solution begins to fail, a big problem can be created, especially if you only realize these failures after a few days or weeks. If the batch jobs have downstream dependencies and the job runs multiple times per day, then you likely have a very big problem, with a lot of corrupted data. Fixing that problem will require some serious time, effort, diligence, and patience.

How you resolve this is very dependent on the data, but recall something you learned in the “Handle Missing Data” section. You learned about MAR, MCAR, and MNAR. In scenarios in which the data falls into the categories of MAR and MCAR, it is safe to delete without having significant impact on your business insights and EDA. However, if your data is MNAR, then the approach you learned about called imputation is helpful. If that approach does not meet the requirements, there are some other approaches, but it all depends on you knowing what the data should look like. If you do not know the target you are working towards, then it is highly improbable that you will hit it. Begin by creating some queries or a file that contains the data in the form you expect. You can take that query and run the inverse of it, so that the data you do not want is returned. After you review the dataset, you can use that query to either delete or manipulate the dataset to cleanse it, which makes it useful. Data quality is one of the core data management disciplines discussed in Chapter 5 (refer to Figure 5.41). There is no quick and simple solution to cleanse your data. The amount of data in the Big Data industry is so huge that cleansing data manually is not realistic. If your organization can afford and utilize tools that specialize in this activity, you should use them; otherwise, you might find yourself in a frustrating, no‐win situation.

Bill Mettler

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *