Handle Duplicate Data – Create and Manage Batch Processing and Pipelines

The impact of having multiple rows of identical data in your table and database depends on your requirements. In most cases, however, having a duplication of a row is not desired. When you are running aggregations that use SUM, AVG, COUNT, etc., having a duplicate can result in rendering the wrong measurement. Therefore, you do want to remove them at some phase in your data transformation pipeline. If you can get them removed before ingestion, that would be even better, but that is not always achievable. A very important requirement for managing and removing duplicate data is ensuring that there is a way to identify a duplication. Consider the following brain waves–reading JSON file, for example: [ “ReadingDate”: “2021-09-12T09:00:19.1072902+02:00”,

The data provides a few opportunities for identifying whether a row is duplicated. There is very precise ReadingDate that should not exist more than a single time in the file. A Counter that attempts to uniquely identify a reading is also included. Notice that the JSON data includes two entries with a different ReadingDate that have the same Counter value of 11. Another two readings share an identical ReadingDate and Counter. Using both of these attributes together enables you to filter out duplicates, whereas using one or the other individually may not. Should the reading with a different ReadingDate be included or excluded from the ingestion? Whether these are duplicates is a decision made by a data expert, someone who understands how the data is collected. Once you are able to make the functional decision on what constitutes a duplicate, you can then write the query or create the transformation instructions to exclude or include it. There are numerous approaches for removing duplicates, some of which are discussed here.

Azure Synapse Analytics

When you load data into an Azure Synapse Analytics dedicated SQL pool table, you will realize the support for primary keys is missing. If primary keys were supported, then when you enforce such a constraint and a process attempts to enter a duplicate row, the insertion would fail. Having and enforcing primary keys and unique constraints would prevent duplicate data from being inserted into a dedicated SQL pool table. It was determined that extract, transfer, and load (ETL) use cases for dedicated SQL pool tables very often had a duplication of data. Throwing an error is very impactful on performance and throughput, so it was decided to avoid that, which means you need to take preventative actions if the existence of duplicate data impacts your data analytics pipeline and insights gathering. There are many methods for removing or preventing duplicates, for example, the MERGE SQL command, which was introduced in Chapter 3. Using the MERGE command with the MATCHED keyword results in duplicates not being inserted but updated. Another approach is to use other SQL commands like JOIN or UNION. In both of those scenarios, it is required to have multiple tables, for example, a fact table and a temporary table. You could then JOIN or UNION the two tables to find data duplicates, and then remove the duplicates. Another approach to avoiding duplicates that is available as a feature in both Azure Synapse Analytics and Azure Data Factory is to use a data flow. Capabilities within a data flow can prevent the duplicate from being ingested at all, which would negate the need to use MERGE, JOIN, or UNION to remove data.

Bill Mettler

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *