Handling Duplicates in Azure Blob Storage Extracts for Zuora Data

Last updated
Save as PDF

Explains how to handle duplicates in Azure Blob Storage Extracts for Zuora Data

Azure Blob Storage is append-only for Zuora exports. Updates result in new rows in the Parquet files. Directly querying these files can make it look like there are duplicate records.

Duplicates appear if multiple updates occur for the same subscription.id between extract runs. Multiple rows will exist.

This is expected behavior and does not indicate errors or data loss.

Dealing with duplicates

Deduplicate before querying:

Apply merge logic based on primary key (subscription.id).
You can apply merge logic by using Azure Synapse, Spark, or other query engines on top of Blob Storage.
Use views or queries to keep only the latest version of each record.

Feed into a downstream warehouse (if available):

Load the Parquet files into Synapse, Snowflake, or Redshift.
Perform upsert/merge to produce a clean, single-version view.

Always apply deduplication or merge logic before consumption to ensure data correctness in analytics.