PolyBase is a very powerful technology that works behind the scenes to enable the execution of SQL queries on data files, where the data files are, for example, in CSV format. In the Azure Synapse Analytics context, you use the PolyBase technology on serverless SQL pools that host external tables (refer to Table 3.7). Chapter 2 explained CETAS, which is the approach for creating external data tables. Refer to Figure 2.12 and Exercise 4.11, where you created an external table named SampleBrainwaves on the built‐in serverless SQL pool that targeted a Parquet file. Figure 4.33 illustrates how this looks in Azure Synapse Analytics and the result of the query that utilized PolyBase to render the result set.
Read from and Write to a Delta Table
As mentioned in Chapter 3, a delta lake is a storage and management software layer that provides numerous benefits, including ACID transactions, time travel, and upserts. It is within your delta lake that you find delta tables, which benefit from the delta lake software layer. In Exercise 3.15 and Exercise 5.4, you created, populated, and queried delta tables. You achieve this by executing Python code similar to the following snippet:
df = spark.read \
.option(“header”,”true”).csv(“/FileStore/tables/brainwavesMeditation.csv”)
df.write.mode(“overwrite”).format(“delta”).save(“/FileStore/data/2022/03/14”)
brainwaves = spark.read.format(“delta”).load(“/FileStore/data/2022/03/14”)
display(brainwaves)print(brainwaves.count())
The code first reads from a CSV file stored locally on a Spark cluster, and then writes the data in delta format into a dataframe. The format() method passing the parameter of delta instructs the platform to organize the data correctly. Once the file is written in delta format, you can use the following code snippet to load the data contained in the file into a delta table:
display(spark \
.sql(“CREATE TABLE BRAINWAVES USING DELTA LOCATION ‘/FileStore/data/2022/03/14′”))
display(spark.table(“BRAINWAVES”).select(“*”).show(10))
print(spark.table(“BRAINWAVES”).select(“*”).count())
The SQL utilizes the USING SQL clause along with the DELTA keyword to instruct the creation of the table in delta form. You can then select and manipulate the data using standard SQL commands and queries.
Manage Batches and Pipelines
As you work through the Exercises in this book, you might have trouble remembering what exactly each pipeline or activity does. This is not uncommon and is why good documentation and processes to support the management of your batch jobs and pipelines are necessary to keep your data analytics solution healthy. Figures 6.11, 6.15, 6.18, and images in general, were created to provide a reference to what is happening in the pipeline and how it all links together. Writing pseudocode or the sequential steps a batch job performed is a useful technique for helping other team members understand what is happening. For example, the brainjammer‐batch.exe batch job does the following:
- Retrieves input from the command line
- Builds the connection object for the ADLS container
- Connects to the ADLS container and retrieve files
- Loops through each brainjammer brain wave session file
- Calculates a single median frequency value for each electrode
- Uploads a new JSON file
These six steps will take you less time to read than looking over the source code to get an idea about what is going on in the executable. The source code is the Program.cs file in the Chapter06/Ch06Ex01 directory. Managing your data analytics solution requires, in many respects, more effort than building and designing it. Therefore, it behooves you to build management capabilities into your solution. The following sections discuss some additional techniques for managing your batches and pipelines.