The design part of incremental loading was covered in Chapter 3, in the section “Design for Incremental Loading.” As shown in Figure 3.3, the serving layer is part of the lambda architecture. This section discusses the batch layer, which provides data to the serving layer along the cold path. As a quick refresher, remember that in all cases you want to make your data loading, movement, and transformation as efficient as possible. With that in mind, you can relate that it is not the most optimal approach to load all the data from a data source to a data analytics datastore on a regular basis. Instead, you would want to transfer only the newly added or modified data—a process known as incrementally loading data.
Why is an incremental data load more efficient and optimal when compared to a complete transfer of all source data? There are a few reasons, the first being that the speed and performance of the incremental load would be consistent over time. The more data you need to migrate, the longer it takes and a greater amount of compute resources are required. As time progresses the amount of data will increase; every day you will get new data, which requires longer transfers. However, transferring the bulk once, and then only the new and modified data after that initial load, results in a consistent amount of data being regularly transferred. Another reason is that the job will run faster, which makes sense because the less data you move, the less time it takes. Lastly, touching less data reduces the possibility of error. If you are moving all your data each time and there is an exception that causes the transfer to fail, there can be significant downstream impacts. If you instead load the changes, the impact of a failed or partial run would be more manageable because the time to repair and the amount of data required to retransmit would be much less when data is loaded incrementally. As discussed in the following sections, numerous approaches are useful for performing incremental loads along the cold path.
Watermarks
A watermark is a column in a source table that contains an identifier used to determine the current location/state/status of incremental data movement. The watermark value is commonly a datetime stamp but can also be an incrementing key. There are a few implementation methods you might consider concerning watermarks. The first is similar to the hashed column approach discussed previously. In that scenario each row has an additional column that is used to determine if the row has been updated since the previous run. Instead of using a hash value, you might consider making the column contain a datetime. The datetime can be either when it was inserted or updated. If the datetime in that column is greater than the last time in which the job ran, you can conclude it needs to be part of the incremental load. If your source data is a relational database that has an incremental primary key, then that can be used as a watermark. When your batch job is run, the first step in the code can be to access a reference table that stores the primary key of the last row used in the most recent job execution. For example, the value retrieved at the beginning of the batch run might be something like 1110. Then, any row having a primary key greater than that number would be included in the incremental load. The primary key of the last row, for example, 2323, would be inserted into the reference table and used as the start +1 the next time the batch job is triggered. In Exercise 6.10 you inserted the following two rows into the READING table:
+————+————————–+——–+
| READING_ID | READING_DATETIME | VALUE |
+————+————————–+——–+
| 5626757 | 20220710 04:37:35.394 PM | 22.454 |
| 5626758 | 20220710 04:37:35.445 PM | 4.849 |
+————+————————–+——–+
In this example, both scenarios mentioned previously would work. There is a datetime that can be compared to the batch runtime to determine if the row should be included. There is also a READING_ID that is an incremental primary key on the READING table. This can also be used to determine whether or not the row should be included.
Last Modified Date
A very useful piece of metadata that files expose is the date when they were last modified. You can access this metadata and use it to determine if the file was modified or created since the last time the batch job ran. To achieve that, complete Exercise 6.11.