- 2 Minutes to read
- DarkLight
Snapshot Retention
- 2 Minutes to read
- DarkLight
Replace
This is the default snapshot retention policy. When a snapshot is taken, newly extracted data from the 3rd-party service (like Google Analytics or Salesforce CRM) fully replaces the data from all previous snapshots (data from all historic data is erased). In SQL, this operation is an equivalent of TRUNCATE followed by INSERT.
Existing data within the source
Date | Sessions | Pageviews |
---|---|---|
2019-01-01 | 51 | 123 |
Data in the newly taken snapshot
Date | Sessions | Pageviews |
---|---|---|
2019-01-02 | 47 | 102 |
Source content after a new snapshot is taken
Date | Sessions | Pageviews |
---|---|---|
2019-01-02 | 47 | 102 |
Append
Each time the snapshot is taken, the newly extracted data is merged with already existing data in the Source (merge of all previous snapshots). This is particularly useful when long-term data collection is required or the data in the external system last only for a limited time (e.g. Instagram Stories analytics data last only for 24 hours). In SQL, this operation is an equivalent of the INSERT.
Existing data within the source
Date | Sessions | Pageviews |
---|---|---|
2019-01-02 | 51 | 123 |
Data in the newly taken snapshot
Date | Sessions | Pageviews |
---|---|---|
2019-01-02 | 47 | 102 |
Source content after a new snapshot is taken
Date | Sessions | Pageviews |
---|---|---|
2019-01-01 | 51 | 123 |
2019-01-02 | 47 | 102 |
Choosing the Right Snapshot Retention
Destination: Data warehouse, Database, Fileserver
When the destination of your Data flow is storage like BigQuery, Redshift, or MySQL, the right snapshot retention setup in 99% of cases is the Last snapshot only. This will secure that each time a snapshot is taken its content only (the most actual data) is inserted using Data flow to the storage. Setting the policy to Merge all snapshots will result in bloating duplicates in your storage.
Destination: Dashboarding app
When you are using a direct connection to your dashboarding apps like Google Data Studio, PowerBI, or Tableau via Data flow, the setup is not as clear and strongly depends on the nature of the data and your dashboard configuration.
Timeseries data
When you plan a long term collection of time-series data, using Merge all snapshots is usually the right setup for you. It is particularly useful when only "lifetime" values of the metrics are available (e.g. Facebook Post connector), allowing you to take daily snapshots of the values and attaching it to the time-series. Though the connector cannot obtain a historic data for you, you can gradually build your own time-series using this setup.
Some connectors provide data with daily granularity (e.g. Facebook Page connector), therefore the connector can obtain the historic data as well. In such cases, use Merge all snapshots in conjunction with the Yesterday date range. The system will each day automatically attach "yesterday's" data to the time-series.
Most recent values
When you plan to see only the most recent values only, e.g. total number of views of your YouTube videos to date or the value of your Shopify orders in the current month use the Last snapshot only retention.
Limitations
Though Sources can act as a permanent data storage, their design was not intended as a replacement for dedicated data warehousing solutions like BigQuery, Snowflake, or Redshift. However, for the vast amount of cases (like when you need to stream data directly to BI/dashboarding apps like Data Studio, PowerBI, or Tableau) it will provide great flexibility without the need for provisioning dedicated data storage systems. As a rule of thumb, when you expect to require more than 0.5M rows per source, we strongly recommend to stream data using data flows directly to some external data storage.