Handling Duplicates in Google Cloud Storage Extracts for Zuora Data

Last updated
Save as PDF

Explains how to handle duplicates in GCS extracts for Zuora Data

Google Cloud Storage (GCS) is append-only for Zuora exports. Updated records are appended to Parquet files. Querying these files directly may show apparent duplicates.

Duplicates appear because multiple updates to the same record between extracts result in multiple rows with the same primary key.

This is expected behavior and does not indicate errors or data loss.

Dealing with duplicates

Deduplicate before querying:

Merge records based on primary key (subscription.id) to retain only the latest version.
You can merge records by using BigQuery external tables, Spark, or other query engines on top of GCS.

Feed into a downstream warehouse (if available):

Load Parquet files into BigQuery, Redshift, or Snowflake.
Perform upsert/merge to maintain a clean, single-version view.

Deduplication or merge logic is required for accurate analytics when using append-only object storage like GCS.