---
title: "Data Duplication: Understanding and Resolving Common Issues"
slug: "data-duplication"
description: "Resolve data duplication issues easily and quickly in Dataddo. Select the right write mode, choose the correct unique columns, and adjust snapshotting policy."
tags: ["Resource"]
updated: 2024-02-22T14:03:57Z
published: 2024-02-22T14:03:57Z
---

> ## Documentation Index
> Fetch the complete documentation index at: https://docs.dataddo.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Duplication: Understanding and Resolving Common Issues

**Data duplication**, or **data redundancy**, is a common challenge within databases and storage systems, characterized by the presence of identical copies of data across different locations. This can occur both intentionally, for backup and redundancy purposes, and unintentionally, through errors in data entry or integration processes. However, the implications of unchecked data duplication are far-reaching, affecting various aspects of data management and utilization.

Here are some potential issues which might cause data duplicates:

- [Duplicate records in source](/docs/data-duplication#duplicate-records-in-source)
- [Incorrect snapshotting policy](/docs/data-duplication#incorrect-snapshotting-policy)
- [Conflicting date range](/docs/data-duplication#conflicting-date-range)
- For **data storages**:
  - [Incompatible write mode](/docs/data-duplication#incorrect-write-mode)
  - [Wrong unique columns when using upsert](/docs/data-duplication#wrong-unique-columns-for-upsert)

## Duplicate Records in Source

Duplicate records often stem from source data anomalies.

1. The source table has duplicit records.
2. Multiple values for typically singular fields get flagged as duplicates. For example, when the same person registers twice, each time with a different email/phone number.

## Incorrect Snapshotting Policy

Incorrect [snapshot keeping policy](/docs/extraction#snapshot-keeping-policy) can inadvertently cause data duplicities. Ensure the right combination of [date range](/docs/setting-date-range) and [sync frequency](/docs/extraction#extraction-frequency).

**Example**: Then the date range is set to the **last 7 days** with a **daily sync frequency**, the system will insert the last 7 days' data every day. This results in data being replicated sevenfold.

To optimize snapshotting:

- For storages, databases, warehouses: Select **Replace** as the snapshotting policy.
- For dashboarding apps: Select **Append** snapshotting policy.

## Conflicting Date Range

Date range during ***data source*** configuration is by default set to **Yesterday** to extract the pervious day's data every day.

If you want to [load historical data](/docs/data-backfilling), the best practice is to create a dedicated additional ***data source***.

## Data Storages: Write Modes and Unique Columns

### Incompatible Write Mode

If your ***destination*** is a data storage, selecting the **insert** write mode may lead to duplicates.

Instead, use the [**upsert** write mode](/docs/data-storages#write-modes), which updates existing records and inserts new ones.

### Wrong Unique Columns for Upsert

Make sure to select the right **unique column(s)** for upsert write mode.

**Example**: You have a database of customer interactions. `Customer ID` would be a good choice for a unique column because each customer can have only one ID. If you use `Login Time`, there will be duplicates.

**1. `Customer ID` as unique column** `Customer ID` is a stable and unique identifier for each customer. When new data comes in, the **upsert** process checks if a record with the same `Customer ID` already exists. If it does, the system updates that existing record with the new information. If not, it inserts a new record. This approach effectively avoids duplication because `Customer ID` uniquely identifies each customer.

**2. `Login Time` as unique column** `Login Time` is used as the unique identifier. However, `Login Time` varies with each login, meaning it's unlikely to be the same across multiple records. As a result, the **upsert** process may not find a matching `Login Time` in the existing records. This leads to the system treating most incoming data as new information. As new records will be created each time, it will result in data duplication. This unique column fails to recognize and update existing customer records because `Login Time` does not serve as a consistent and unique identifier like `Customer ID`.

Choosing a unique column like `Customer ID`, which consistently and uniquely identifies each record, is crucial for **upsert** to ensure data is accurately updated without duplication.

Upsert inserts new rows and updates existing ones in the destination. By ensuring that the data in the table is updated if a matching record exists, or a new one is inserted, upsert prevents data duplication.
