Data Duplication
  • 2 Minutes to read
  • Dark
    Light

Data Duplication

  • Dark
    Light

Article Summary

Check the Records in Your Source

If a single record contains multiple values for fields that typically only have one value, such as having two email addresses, phone numbers, or orders, then the record will be duplicated in the table. Even though the second record will contain the second set of values, this will appear as a duplicate.

Check Your Snapshotting Policy

Some combinations of snapshotting settings can lead to duplicities. Make sure you chose the right combination of date range and sync frequency.

Example
Your date range is last 7 days but your snapshotting frequency is daily. In this case, everyday the last 7 days' worth of data will be inserted. As a result, the amount of data will be 7 times bigger than it should be.

When configuring your snapshotting, consider what destination it is for:

  • Storage, database, warehouse - choose Replace sync type
  • Dashboarding apps - choose Append for sync type

Check Your Date Range

The date range while creating a source is not meant to be used for loading historical data, so setting it to e.g. the last 30 days is not correct. The most common setting would be to set it to Yesterday and pull yesterday's data daily adding them to the previously pulled ones.

If you need to load the last 30 days, create a source with a date range set to yesterday, and then use the Ad-hoc data load to load the extra 29 days.

If you choose the Snapshot keeping policy to be Append and keep all the snapshots, it would mean that Dataddo crates 29 days of duplicates every day.

Data Storages

Check Your Write Mode

If the write mode for your destination is INSERT, it may cause duplicities in the table. You can instead select UPSERT write mode.

Incorrect Unique Columns for UPSERT

If you choose to UPSERT write mode your data, make sure to select the correct Unique Columns. Otherwise, instead of updating the existing records and inserting the new ones, the records will be inserted again, as the unique column was not found.

Example
If you choose Customer ID as a unique column, the system will first check if the record already exists and then update the all existing ones if needed. However, if you select e.g. Login Time as the unique column, this value will not be the same and the record will be then duplicated.



Was this article helpful?

What's Next