Snapshot retention

Each Source can be set to either keep the latest snapshot or merge all snapshots together every time when a snapshot is taken. Choosing the right policy depends on the destination connected via Data Flow.

Last snapshot only

This is the default snapshot retention policy. When a snapshot is taken, newly extracted data from the 3rd-party service (like Google Analytics or Salesforce CRM) fully replaces the data from all previous snapshots (data from all historic data is erased). In SQL, this operation is an equivalent of TRUNCATE followed by INSERT.

Existing data within the source

Date Sessions Pageviews
2019-01-01 51 123

 

Data in the newly taken snapshot

Date Sessions Pageviews
2019-01-02 47 102

 

Source content after a new snapshot is taken

Date Sessions Pageviews
2019-01-02 47 102

Merge all snapshots

Each time the snapshot is taken, its content is merged with already existing content in the Source (merge of all previous snapshots). This is particularly useful when long-term data collection is required or the data in the external system last only for a limited time (e.g. Instagram Stories analytics data last only for 24 hours). In SQL, this operation is an equivalent of the INSERT.

Existing data within the source

Date Sessions Pageviews
2019-01-01 51 123

 

Data in the newly taken snapshot

Date Sessions Pageviews
2019-01-02 47 102

 

Source content after a new snapshot is taken

Date Sessions Pageviews
2019-01-01 51 123
2019-01-02 47 102

Choosing the right snapshot retention

Destination: Data warehouse, Database, Fileserver

When the destination of your Data flow is storage like BigQuery, Redshift, or MySQL, the right snapshot retention setup in 99% of cases is the Last snapshot only. This will secure that each time a snapshot is taken its content only (the most actual data) is inserted using Data flow to the storage. Setting the policy to Merge all snapshots will result in bloating duplicates in your storage.

Destination: Dashboarding app

When you are using a direct connection to your dashboarding apps like Google Data Studio, PowerBI, or Tableau via Data flow, the setup is not as clear and strongly depends on the nature of the data and your dashboard configuration.

Timeseries data

When you plan a long term collection of time-series data, using Merge all snapshots is usually the right setup for you. It is particularly useful when only "lifetime" values of the metrics are available (e.g. Facebook Post connector), allowing you to take daily snapshots of the values and attaching it to the time-series. Though the connector cannot obtain a historic data for you, you can gradually build your own time-series using this setup.

Some connectors provide data with daily granularity (e.g. Facebook Page connector), therefore the connector can obtain the historic data as well. In such cases, use Merge all snapshots in conjunction with the Yesterday data range. The system will each day automatically attach "yesterday's" data to the time-series.

Most recent values

When you plan to see only the most recent values only, e.g. total number of views of your YouTube videos to date or the value of your Shopify orders in the current month use the Last snapshot only retention.

Limitations

Though Sources can act as a permanent data storage, their design was not intended as a replacement for dedicated data warehousing solutions like BigQuery, Snowflake, or Redshift. However, for the vast amount of cases (like when you need to stream data directly to BI/dashboarding apps like Data Studio, PowerBI, or Tableau) it will provide great flexibility without the need for provisioning dedicated data storage systems. As a rule of thumb, when you expect to require more than 0.5M rows per source, we strongly recommend to stream data using data flows directly to some external data storage.