Handle Security and Compliance Requirements – Create and Manage Batch Processing and Pipelines – Study guide for Exam DP-203: Data Engineering

The in‐depth coverage of security‐related concepts is coming in Chapter 8, “Keeping Data Safe and Secure.” The different aspects of security are vast, but here you will get a short introduction to the following topics:

Networking
Endpoint security
Backup and recovery

While provisioning your Azure Batch account in Exercise 6.1, you may have noticed the Networking tab. You were instructed to leave the default value of All Networks. The other two options, Selected Networks and Disabled, provide some additional security features. For example, the Azure Batch endpoint you can use to trigger and configure your nodes and jobs is by default globally discoverable. The endpoint is not publicly accessible, rather only discoverable, which means you need an access key or some other form of authentication to configure nodes and manually trigger batch jobs. Had you chosen Disabled, the endpoint would be removed from the public DNS, which removes access from the public. Even if you had an access key, you could not access the endpoint publicly. Disabling the public endpoint and making it private is a feature of endpoint security. The other option, Selected Networks, can be used to restrict access by IP address ranges of machines that can access the Azure Batch account endpoint. The restriction is typically placed on VMs within a shared VNet or VMs in another peered VNet.

When creating backups, keep in mind that they contain the exact same data as the production data on your live database. The backups need to be protected just like access to your database. Backups are stored in an Azure storage account that can be restricted by RBAC controls. Additionally, backups are encrypted in transit and at rest, so even if a backup leaks out to the public, it is very unlikely that anyone would be able to decrypt and consume the data. There are many aspects to security, many of which will be covered in Chapter 8 in the context of Microsoft Defender for Cloud.

Design and Create Tests for Data Pipelines

As your pipelines become more complex and the data analytics solution you created becomes more critical to your company, testing will also become critical. Testing should be focused not only on newly created pipelines and data flows but also on changes to existing capabilities. When testing a new pipeline, you must consider performance, latency, and whether the results of the data ingestion and transformation meet the expectations. Those three aspects should take place without any unexpected programming logic exceptions. Exceptions can and often do happen, but you need to have recovery paths in place to protect your data. When you are making updates to existing pipelines, the previous points remain valid. In addition, though, you might have built in upstream and downstream dependencies. Each of those scenarios must be tested to ensure that no issues can cause data corruption or delay the processing and delivery of data to the consumers. Later chapters provide an in‐depth look at monitoring, troubleshooting, and optimizing your data analytics processing on Azure—specifically, Chapter 9, “Monitoring Azure Data Storage and Processing,” and Chapter 10, “Troubleshoot Data Storage Processing.”

Study guide for Exam DP-203: Data Engineering

Handle Security and Compliance Requirements – Create and Manage Batch Processing and Pipelines

Design and Create Tests for Data Pipelines

Bill Mettler

Leave a Reply Cancel reply

Design and Create Tests for Data Pipelines

Related Posts

Implement Version Control for Pipeline Artifacts – Create and Manage Batch Processing and Pipelines

Trigger Batches – Create and Manage Batch Processing and Pipelines

Implement Azure Synapse Link and Query the Replicated Data – Create and Manage Batch Processing and Pipelines

Bill Mettler

Leave a Reply Cancel reply