Create and Schedule an Azure Databricks Workflow Job – Create and Manage Batch Processing and Pipelines – Study guide for Exam DP-203: Data Engineering

Azure Databricks

When you perform data analytics in Azure Databricks, it means you are using an Apache Spark cluster. It is possible to use Apache Spark pools in Azure Synapse Analytics, as you did in Exercise 6.3, where you created and ran a Spark Job Definition activity. In Exercise 6.4 you created a notebook named IdentifyBrainwaveScenario using an Azure Databricks workspace that imported some brain waves in AVFO format from a blob container. The file contained brain wave readings that were filtered by predetermined frequency ranges for a given session, like ClassicalMusic. The data was converted to delta and the results rendered as output to the output window. This notebook, like the Spark job, was then triggered manually from an Azure Synapse Analytics pipeline. You now know that it is possible to trigger the notebook that exists and runs on an Azure Databricks Apache Spark cluster. The available trigger types when triggered from Azure Synapse Analytics are scheduled, tumbling window, and custom event. Azure Databricks does include some scheduling capabilities, but they are not as sophisticated as those in Azure Synapse Analytics. You have read a bit about these capabilities and have seen some of the features in Figure 6.16 and Figure 6.19. In all cases, however, the execution of the notebook was manual. There also exist some scheduling capabilities. Complete Exercise 6.8, where you will schedule the execution of an Azure Databricks notebook using a workflow job. To successfully complete this exercise, you must have already completed Exercise 6.4.

Download the Jupyter/IPython file in the Chapter06/Ch06Ex04 named IdentifyBrainwaveScenario.ipynb on GitHub: https://github.com/benperk/ADE.
Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Databricks workspace you created in Exercise 3.14 ➢ click the Launch Workspace button on the Overview blade ➢ select Workspace from the Azure Databricks Workspace navigation menu ➢ select the down arrow next to your user ID ➢ select Import from the menu ➢ and then upload the Jupyter file.
Select Workflows from the navigation menu ➢ click the Create Job button ➢ enter a task name (I used IdentifyBrainwaveScenario) ➢ leave Notebook as the Type and Workspace as the Source ➢ select the notebook you imported from the Path text box pop‐up menu ➢ select the edit icon to the right of the Shared_job_cluster in Cluster text box ➢ expand the Advanced options ➢ enter the following syntax into the Spark configuration section ➢ and then change the access key to your Azure blob storage account, like you did in Exercise 6.4.

Consider reducing the cluster size and number of instances to save costs ➢ click Confirm ➢ and then click Create.
Click the Add Schedule button ➢ select the Scheduled radio button from the Trigger Type section ➢ check the Show Cron Syntax check box ➢ and then enter the following syntax. Figure 6.41 shows the configuration.

FIGURE 6.41 Azure Databricks scheduled trigger

Click the Save button ➢ select Workflows from the navigation menu ➢ and then select the job you just created. After a few minutes, you will see something similar to Figure 6.42.

FIGURE 6.42 Azure Databricks scheduled trigger log

Click the Pause button to stop the job.

As mentioned, the capabilities for scheduling jobs in Azure Databricks are basic; only scheduled and manual jobs are supported. If you need more sophisticated scheduling capabilities, you can trigger an Azure Databricks job from an Azure Synapse Analytics pipeline.

Study guide for Exam DP-203: Data Engineering

Create and Schedule an Azure Databricks Workflow Job – Create and Manage Batch Processing and Pipelines

Azure Databricks

Bill Mettler

Leave a Reply Cancel reply

Azure Databricks

Related Posts

Implement Version Control for Pipeline Artifacts – Create and Manage Batch Processing and Pipelines

Trigger Batches – Create and Manage Batch Processing and Pipelines

Implement Azure Synapse Link and Query the Replicated Data – Create and Manage Batch Processing and Pipelines

Bill Mettler

Leave a Reply Cancel reply