Design and Develop Slowly Changing Dimensions – Create and Manage Batch Processing and Pipelines – Study guide for Exam DP-203: Data Engineering

The design of an SCD table was discussed in detail in Chapter 3, in the section “Design the Serving Layer.” The notion of a star schema was also introduced in this same context, which is where you most often find dimension tables. Refer to Figure 3.14 for an illustration of what a star schema often looks like, and refer to Table 3.5 for an overview of the different SCD types. The building, development, and implementation of an SCD table was performed in Chapter 4, specifically in Exercise 4.10 and the section “Implement Logical Data Structures.” Refer to Table 4.6 for a reference to the figures in this book that represent each SCD type. There is also a reference to SCDs in the “Handle Late‐Arriving Data” section in this chapter. You need to know the internals and use cases of SCD types for the DP‐203 exam.

This section falls into the parent section of “Design and Develop a Batch Processing Solution.” This means there needs to be some kind of content around anything unique or different when compared to the serving layer or logical data structures. However, there are no unique design considerations in the context of batching when compared to logical structures or the serving layer. The implementation, however, would be different. In Exercise 4.10 you manually updated a Type 2 SCD dimension table; however, in a batch scenario you would include that query in a scheduled batch job that checks for changes and applies them when necessary. All of the terms you have learned related to batching—such as duplicate data, missing data, late‐arriving data, upserting data, triggering batch jobs and pipelines, dependencies, and retention—now must be considered. The more you learn, the more complicated things become, as you begin to recognize that your design and implementation has many decision points. Additionally, you begin to recognize each decision has upstream and downstream implications that can require a review on their impact resulting in a different decision.

As a review, consider the following Type 2 SCD table, which you created in Exercise 4.10:

CREATE TABLE [brainwaves].[DimELECTRODE] (
[ELECTRODE_ID]   INT NOT NULL   IDENTITY(1,1),
[SK_ID]          NVARCHAR (50) NOT NULL,
[ELECTRODE]      NVARCHAR (50) NOT NULL,
[LOCATION]       NVARCHAR (50) NOT NULL,
[START_DATE]     DATETIME2      NOT NULL,
[END_DATE]       DATETIME2,
[IS_CURRENT]     INT NOT NULL)

A Type 2 SCD table has the capability to provide versioning. With the existence of the START_DATE and END_DATE, along with the SK_ID and IS_CURRENT flag, it is possible to know which row is current and when previous rows were valid. Remember that dimension tables hold reference data used by the fact tables to transform itself into data that is useful for finding business insights. If dimension table data changes but you want to transform data using a previous version of the reference data on the dimension table, you can use a Type 2 SCD table. Finally, the SK_ID is a surrogate key that is helpful in the unique identification of a row. There is also the concept of a business key that could be used instead of a surrogate key, where a business key is something like an identification number used only within your company for the purpose of uniqueness.

Study guide for Exam DP-203: Data Engineering

Design and Develop Slowly Changing Dimensions – Create and Manage Batch Processing and Pipelines

Bill Mettler

Leave a Reply Cancel reply

Related Posts

Implement Version Control for Pipeline Artifacts – Create and Manage Batch Processing and Pipelines

Trigger Batches – Create and Manage Batch Processing and Pipelines

Implement Azure Synapse Link and Query the Replicated Data – Create and Manage Batch Processing and Pipelines

Bill Mettler

Leave a Reply Cancel reply