Bulk data extraction from Zuora using AQuA

Last updated
Save as PDF

This article describes the best practices for using AQuA to extract data from Zuora and to load the data into an external data warehouse.

The following steps are recommended:

Initial data extract:
1. Query historical data
2. Load historical data into the external data warehouse
Incremental data extracts:
1. Query changed data
2. Load changed data into the external data warehouse

Best practices when querying data from Zuora

To extract data using AQuA, Zuora recommends the following steps:

Query each data source individually
Batch multiple queries into a single job
Filter by UpdatedDate only
Use AQuA stateless queries and track your own high water mark
Extract deleted data in each query

Query each data source individually

When extracting data from Zuora with AQuA, Zuora recommends that you avoid querying fields from the joined objects of a data source. You should restrict your query to only include the fields from the base object of the data source and the IDs of the parent objects. Once loaded into your external repository, you can use the query language of the external repository to reconstruct the dependent relationships.

For example:

Not recommended	Recommended
SELECT id, rateplancharge.id, rateplan.id AS rateplanid, tier, subscription.status FROM RatePlanChargeTier WHERE ...	SELECT id, rateplancharge.id, ... FROM RatePlanChargeTier WHERE ... SELECT id, rateplan.id, ... FROM RatePlanCharge WHERE ... SELECT id, subscription.id, ... FROM RatePlan WHERE ...

Not recommended

Recommended

SELECT id,
       rateplancharge.id,
       rateplan.id AS rateplanid,
       tier,
       subscription.status
FROM RatePlanChargeTier
WHERE ...

SELECT id,
       rateplancharge.id,
       ...
FROM RatePlanChargeTier
WHERE ...

SELECT id,
       rateplan.id,
       ...
FROM RatePlanCharge
WHERE ...

SELECT id,
       subscription.id,
       ...
FROM RatePlan
WHERE ...

Batch multiple queries into a single job

AQuA allows you to submit multiple queries in a single job. All queries in a job are performed in a single database transaction, against a consistent snapshot of your transactional data. This ensures referential integrity between data sources even though you are querying each data source individually.

Filter by UpdatedDate only

If you include either of the following column types in a WHERE clause, the query may require a full scan of your data:

An unindexed column
A column in a join table

If you include either of these column types in a WHERE clause, the response time is unpredictable.

The UpdatedDate column is indexed in all data sources. By filtering on the UpdatedDate column, Zuora's system can efficiently extract just the rows required by the query without scanning a full table.

You can add needed columns to perform additional filtering after you load data into your external data warehouse.

For example:

Not recommended ( filter with un-indexed columns) Recommended (filter with the indexed column - UpdatedDate)

Not recommended ( filter with un-indexed columns)	Recommended (filter with the indexed column - `UpdatedDate`)
SELECT ... FROM RatePlanChargeTier WHERE RatePlanCharge.Name = 'Discount' AND (Subscription.Status = 'Active' OR Subscription.Status = 'Cancelled') AND RatePlanCharge.EffectiveEndDate > 'today - 2 day'	SELECT ... FROM RatePlanChargeTier WHERE UpdatedDate >= {startTime} and UpdatedDate < {endTime}

SELECT 
...
FROM RatePlanChargeTier
WHERE RatePlanCharge.Name = 'Discount'
AND   (Subscription.Status = 'Active' OR Subscription.Status = 'Cancelled')
AND   RatePlanCharge.EffectiveEndDate > 'today - 2 day'

SELECT 
...
FROM RatePlanChargeTier
WHERE UpdatedDate >= {startTime} and UpdatedDate < {endTime}

Use AQuA stateless queries and track your own high water mark

While AQuA supports stateful queries, customers prefer to use the high water mark approach described in this article. Zuora highly recommends that you use the stateless mode for bulk data extraction. This is the approach that Zuora plans to evolve in future APIs.

Use AQuA stateless queries. Make sure that you set version to 1.0 in your AQuA request. See Stateless and Stateful Modes for more information.
Track your own high water mark for data replication.

The high water mark is a timestamp that specifies the latest update time for each extraction. That is, the WHERE clause for every query should be in the following format:

WHERE UpdatedDate >= {last high water mark} and UpdatedDate < {new high water mark}

Each time you run your query job, set the new high water mark to at least 2 hours earlier than the current time. It might take a long time for some transactions to be completed. By offsetting the high water mark, you will be sure to include any updates that have not been completed at the time your query is executed.

For example, if the current time is 2019-05-11T06:30:00, you can issue a query with the following WHERE clause:

WHERE UpdatedDate >= '2019-05-10T04:00:00' AND UpdatedDate < '2019-05-11T04:00:00'

For the initial extract, choose a date earlier than you started using Zuora as the last high water mark, such as 2000-01-01.

Extract deleted data in each query

For each query, make sure that you set the forceExport to true within deleted.

See the following example for a batch of queries in stateless mode:

{
        "format": "csv",
        "version": "1.0",
        "name": "replication-client",
        "useQueryLabels": true,
        "encrypted": "none",
        "queries": [
            {
                "name": "query",
                "query": "select Id from account where UpdatedDate >= '2018-01-03T00:00:00' and UpdatedDate < '2018-01-04T00:00:00'",
                "type": "zoqlexport",
                "apiVersion": "96.0",
                "deleted": {
                    "column": "is_deleted",
                    "format": "Boolean",
                    "forceExport": true
                }
            },
            {
                "name": "query",
                "query": "select Id from subscription where UpdatedDate >= '2018-01-03T00:00:00' and UpdatedDate < '2018-01-04T00:00:00'",
                "type": "zoqlexport",
                "apiVersion": "96.0",
                "deleted": {
                    "column": "is_deleted",
                    "format": "Boolean",
                    "forceExport": true
                }
            }
        ]
}

Best practices when loading data into your data warehouse

You can load data into your data warehouse using an initial historical extract and load, followed by incremental data updates using AQuA.

Each of your incremental extractions only contains records that were created, updated or deleted during the time interval specified by the WHERE clause of each query. To load the extracted data into your data warehouse, you must merge the incremental data with previously loaded data.

Query and staging tables

To perform the merge, you need two tables for each Zuora data source in your data warehouse:

A staging table for loading the incremental data. Each time you load an incremental extraction, you must truncate this table and insert the incremental data into this table.
A query table containing the full replica of the data source. Data from the staging table is merged into the query table as part of each load operation.

Naming convention of the query and staging tables

Define a consistent naming convention of the query and staging tables for all the data sources. For example:

Data source	Query table name	Staging table name
Account	account	account_staging
InvoiceItem	invoiceitem	invoiceitem_staging

Table schema

To facilitate data loading, Zuora recommends that:

The query and staging tables of each data source should have exactly the same schema
You should use exactly the same column order in both the table schema and the ZOQL query
The first column of the query or staging table should be of boolean or integer type and named is_deleted, to capture rows associated with objects that have been deleted in Zuora

Data loading steps

To complete a single load operation of a data source, Zuora recommends the following steps:

Delete all data from the staging table
Load incremental batch into the staging table
Delete from the query table any row that has the same Id as a row in the staging table
Insert into the query table all the rows whose is_deleted is false in the staging table

Zuora recommends that you script all the steps for all data sources so that you can perform them in a single transaction. In this way, your data warehouse clients always query against a consistent snapshot of the complete data set.

You can keep track of how fresh the data in the data warehouse is by maintaining a data freshness table of data sources and their last update dates.

Delete data from staging table

Use the following SQL syntax to delete all data from the staging table:

TRUNCATE account_staging

Note that TRUNCATE is preferable to DELETE because it can free up system resources.

Load incremental batch into staging table

When loading the data source records in the incremental batch into the staging table, Zuora recommends that you use a performant batch load technology supported by your database rather than issuing an INSERT statement row by row.

Note that is_deleted must be one column of the table, and one CSV field returned by AQuA. This helps you correctly handle deleted records.

Delete changed or deleted data from query table

In the query table, delete any row that has the same Id as a row in the staging table. In other words, you should delete all the records of the objects that were changed or deleted during the query interval of this incremental batch.

See the following SQL syntax:

DELETE FROM account WHERE id IN (SELECT id FROM account_staging)

Insert new and updated records

In the query table, insert all the new and updated records from the staging table. Make sure that you exclude the deleted records in the staging table.

See the following SQL syntax:

INSERT INTO account (SELECT * FROM account_staging WHERE deleted = false)

Bulk data extraction using other Zuora APIs

Note that Zuora does not recommend using Zuora Data Query API for bulk extraction use cases. Data Query API is intended for operational use cases needing a limited amount of near real-time data.

In the future, Zuora is planning to develop a Bulk Query API for bulk extraction use cases. Bulk Query API will enable customers to extract data from a warehouse that has a delayed copy of Zuora data.