- 2 Minutes to read
-
DarkLight
Data Sampling
- 2 Minutes to read
-
DarkLight
In data analysis, sampling is the practice of analyzing a subset of all data in order to uncover meaningful information in the larger data set.
Although data sampling is a technique used in statistics, you may have experienced it in Google Analytics (Universal and 4) and Google Search Console.
For example, if you wanted to estimate the number of trees in a 100-acre area where the distribution of trees was fairly uniform, you could count the number of trees in 1 acre and multiply by 100, or count the trees in a half an acre and multiply by 200 to get an accurate representation of the entire 100 acres.
Google Analytics and Data Sampling
In analysis, data sampling means taking a small portion of the whole dataset and analyzing it for trends or for verifying hypotheses.
As Google Analytics needs to process large amounts of data quickly, while maintaining accuracy, it randomly samples a slice of your data.
The main problem is this sample doesn't represent the entire picture, often leading to uncertainty and disturbance in your reports. And your sample may or may not reflect the true nature of your data.
Sampling Thresholds
Default reports are not subject to sampling.
Ad-hoc queries of your data are subject to the following general thresholds for sampling:
- Analytics Standard: 500k sessions at the property level for the date range you are using
- Analytics 360: 100M sessions at the view level for the date range you are using
-
Queries may include events, custom variables, and custom dimensions and metrics. All other queries have a threshold of 1M
-
Historical data is limited to up to 14 months (on a rolling basis)
-
There are certain cardinality limits:
- Daily processed tables. In Universal Analytics, the limit is 50,000 rows. In Google Analytics 360, it's 75,000 rows.
- Multi-day processed tables. In Universal Analytics, the limit is 100,000 rows. In Google Analytics 360, it's 150,000 rows.
In some circumstances, you may see fewer sessions sampled. This can result from the complexity of your Analytics implementation, the use of view filters, query complexity for segmentation, or some combination of those factors. According to the official Google documentation, Google Analytics tries to make a best effort to sample up to the thresholds described above, it's normal to sometimes see slightly fewer sessions returned for an ad-hoc query.
Google Analytics 4
The default reports (under the Reports snapshot tab) are not sampled. You're free to add any secondary dimensions, segments, or filters. The reports will remain unsampled. Sampling may occur when you create an advanced analysis, such as cohort analysis, exploration, segment overlap, funnel analysis, etc.
For more information, see the official Google Analytics documentation.
Data Sampling in Google Search Console
Similar case to Google Analytics.
When you group by page and/or query, Google Search Console may drop some data in order to be able to calculate results in a reasonable time using a reasonable amount of computing resources.
To be able to get the exact data, you will need to group only by date, device, and/or country. This way, the totals, as well as granular records, will match exactly with what you can see in GSC UI.
On the other hand, if you need to see the trends per page and/or query, you can do such an analysis but you will need to keep in mind that the totals won't match up completely.
For more information, see the official Google Search Console documentation.