Handle Late‐Arriving Data– Create and Manage Batch Processing and Pipelines

There are many reasons why data might arrive late into your ADLS container, or data warehouse, including issues like network bandwidth, which can slow the throughput of data; not enough processing power to complete the upstream job in a timely manner; or the failure of the upstream job. In the cloud environment, a throttle could have been enforced due to an unexpected increase in consumption. Because there are so many reasons why data can arrive late, there are also many approaches for managing that scenario. Two scenarios are discussed here, but, as you would expect, the solution depends on your situation.

Delay a Pipeline Trigger

When you have a pipeline that is scheduled to run at a specific time and the data it should be processing arrives after the scheduled time, the data will not be processed. If the expected data is not present, the pipeline might fail and stop. If you do not plan for late‐arriving data, code may run and perform unexpected and unwanted changes to your data. A method for handling late‐arriving data in Azure Synapse Analytics is to set the delay time found within a tumbling window trigger. You can see the Delay property in Figure 6.37. Setting this property to 5 minutes means that the trigger will load data that arrives up to 5 minutes after the trigger’s start date. How to come up with the amount of time to wait is based on your knowledge of delays that can occur to dependent upstream jobs. The frequency of the trigger is also useful for determining delay. If the trigger recurrence is hourly, then waiting anything longer than 30 minutes may or may not be a good idea. It depends on how long the pipeline runs and whether there is a possibility of overlapping the next start time. But it really depends on your current requirements and trigger settings.

Fact and Dimension Tables

You were first introduced to fact and dimension tables when they appeared in a star schema in Chapter 3. Table 3.5 listed the different slowly changing dimension (SDC) types, which range from one to six. In Exercise 4.9 and Exercise 4.10, you created dimension tables, and you used them in Exercise 5.1. In Exercise 5.1 you created a stored procedure that used the content in the dimension tables to create the data on the fact table. That scenario was a relatively simple example, because there were not any changes to the dimension tables between Exercise 4.10 and Exercise 5.1; therefore, the validity date on the dimension tables remained static. Imagine a scenario where the generation of the fact table data occurred before the arrival of an update on the dimensional table. And assume that the reason for the late arrival was due to the upstream being delayed.

Late‐arriving fact table data is not as vexing as dimensional data because, as you learned in Exercise 5.1, dimensional tables are used for enriching the fact data. Therefore, it is imperative that dimension tables always be processed successfully before fact tables. There are a few approaches for managing this scenario. The first one is simply to create dependencies between the pipelines that load the fact and dimension tables, ensuring that all the work that needs to be done on the dimension tables is complete before performing the fact table transaction load. Another technique might be to use an approach that was covered in the last section, where a default or calculated value is used when data in the dimensional table is not available. For example, if you are using SDC Type 6 and the start date or end date is missing, your code can use the current date or a hard‐coded date. If you know the default or hard‐coded value, then you can come back at a later time, use that value as a filter, and update the fact table data with the new value, if required. Lastly, you can detect whether the fact table has arrived early. Early arrival can be just as impactful as late arrival, especially if the pipeline trigger is one that runs based on a storage event or customer event.

Bill Mettler

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *