Sink – Create and Manage Batch Processing and Pipelines – Study guide for Exam DP-203: Data Engineering

The Settings tab of a sink in a data flow includes an attribute named Batch Size. This setting is valid in the context of columnar storage and caching, which falls into the realm of Parquet structured files. The Batch Size default value is provided by platform logic, so unless there is a need to modify it, leave it empty. The value sets the amount of data that is loaded into memory in advance of the writing to the destination datastore. This improves performance because accessing data from memory is faster than loading it from disk. Do take some caution concerning the size and consider the amount of memory you have available, because if the value is larger than the available memory, you will get out of memory (OOM) exceptions. Those exceptions will cause the data flow and/or pipeline to fail.

Bulk Loading

Chapter 2 introduced the BULK command statement, which looks something like the following:

BULK INSERT [dbo].[READINGCSV]
FROM ‘csharpguitar-brainjammer-pow-PlayingGuitar-0911.csv’

What you did not see in Chapter 2 was the BATCHSIZE argument. If the value is not set, then it is assumed to be zero and means the entire file is sent to the destination in one transaction. For file size, performance, or network throttling reasons, you may want or need to change this behavior. If the value is changed to, for example, 1000, then that many rows will be sent to the destination as a batch. The transmission will continue in batches of 1,000 rows until there is no more data to send.

Source Code

There are also techniques for batching database transactions in code. If you recall from Chapter 2, JSON brainjammer brain wave files were loaded into the Azure SQL database. The source code is located in the Chapter02/Source Code directory on GitHub; the file name is JSONToAzureSQL.cs. The code snippet used sent an INSERT statement to Azure SQL for each reading, like the following:

foreach (var reading in brainwaves.Session.POWReading)

This code snippet would be enhanced by including a transaction. Placing a begin transaction object before the foreach loop and then committing the transaction outside of the loop would reduce latency. This is because when you are performing about 25 individual inserts, using a transaction would send all the inserts at once to the database and process them all at that time. This is better than sending about 25 individual INSERT statements to Azure SQL sequentially, which is the behavior without using a transaction.

This is a valid and useful form of batching together SQL statements that decreases both latency and network traffic.

Configure Batch Retention

In Exercise 6.2 you added a Custom activity to an Azure Synapse Analytics pipeline that triggered an Azure Batch job. The Settings tab for the Custom activity includes a property named Retention Time in Days. As you may recall, each time a batch job is triggered, the application package (the source code that performs the work on your data) is copied to the node. In addition to the application package, stderr.txt and stdout.txt log files are written and stored on the node. By default, those files persist on the node for 30 days. Depending on the change frequency and what the code in the application package does, the number of files and the amount of space the data needs on the node can be great. Figure 6.54 shows how the files can be viewed via the Azure portal.

FIGURE 6.54 Configure batch retention

If the space runs out, then the batch job can fail. Therefore, if you need to decrease the retention period of the files to provide the necessary space, you can change this. Setting the Retention Time in Days property (aka retentionTimeInDays) to a value less than the default results in space being made available sooner. The opposite is also true: When you need to store the data for longer periods due to governance requirements, you can increase the retention period.

Study guide for Exam DP-203: Data Engineering

Sink – Create and Manage Batch Processing and Pipelines

Bulk Loading

Source Code

Configure Batch Retention

Bill Mettler

Leave a Reply Cancel reply

Bulk Loading

Source Code

Configure Batch Retention

Related Posts

Implement Version Control for Pipeline Artifacts – Create and Manage Batch Processing and Pipelines

Trigger Batches – Create and Manage Batch Processing and Pipelines

Implement Azure Synapse Link and Query the Replicated Data – Create and Manage Batch Processing and Pipelines

Bill Mettler

Leave a Reply Cancel reply