- 6 Minutes to read
- DarkLight
Data Storages: A Technical Overview
- 6 Minutes to read
- DarkLight
Dataddo enables data delivery to a wide array of data storage technologies, embracing not only traditional databases such as SQL Server, MariaDB, and PostgreSQL, but also extending to modern Data Warehouses and Data Lakes like AWS Redshift, Google BigQuery, and Azure Synapse. Furthermore, our support encompasses file storage solutions like AWS S3 or protocol-based data transfers such as SFTP, ensuring comprehensive coverage of all your data management needs.
Find the whole list of supported storage platforms on our website.
Architecture considerations
The data delivery process to Data Storages is typically initiated once the data extraction from the source is complete. However, you can de-couple this writing process from extraction by setting a custom write trigger
The data written to your storage always consists of the full content from the new data extraction. This ensures a complete, up-to-date dataset for your analyses and reports.
With Dataddo, you have the flexibility to choose from various data writing modes to ensure consistency in your data storage. This allows you to tailor the data delivery approach to best match your data management and analysis needs, ensuring accuracy and relevance of your stored data.
Automatic Table Setup
Automatic Table Setup is a key feature of Dataddo that greatly simplifies the process of configuring your data storage. Rather than manually defining the schema, Dataddo automatically establishes an optimized table structure during the initial setup, which includes column names, data types, and indices. However, it's crucial to understand that Dataddo does not automatically update the table schema post-creation. This is a protective measure designed to prevent unintended changes that could affect your data's consistency. As such, any subsequent modifications to the table structure must be manually carried out, allowing for precise control over your data environment.
The Automatic Table Setup feature plays a significant role in the initial configuration of your data storage, particularly when using specific write modes like Upsert. When a new table is created, this feature not only designs an optimized table structure but also incorporates necessary indexing during the first data load. However, it's important to note that if Upsert mode is chosen for an already existing table, Dataddo cannot retroactively apply this automatic indexing.
Adaptive Data Ingestion
Dataddo smartly adjusts to the volume of your data, employing the most suitable method for data loading to your specific data warehouse or database. For smaller data volumes, Dataddo utilizes batch insert queries to maintain efficiency. As the data volume increases, it automatically switches to loading via files in object storage. In cases where the data volume is extremely large, Dataddo can leverage streaming to manage the data load effectively.
Each data warehouse or database platform may require specific techniques for optimal data loading, and Adaptive Data Ingestion is well-equipped to handle this diversity. Furthermore, leveraging some of these data loading techniques may necessitate additional setup or elevated permission levels. For instance, to maximize SQL Server's ability to load data through object storage, users may need to undertake supplementary configuration.
Write Modes
In Dataddo, we provide diverse write modes to help tailor how data is transferred into your chosen destination. These modes include Insert, Insert Ignore, Truncate Insert, Upsert, and Delete. Each mode serves a unique purpose, providing flexibility to accommodate various data handling strategies, whether you are looking to simply add new records, update existing ones, or handle potential duplication. Understanding these write modes can help you efficiently manage your data transfer process and maintain the integrity of your data.
Insert
The Insert mode is the default write setting that appends new data to the existing records in the destination. It's perfect for situations requiring continuous accumulation of data. However, it's important to note that if your data extraction includes overlapping data - instances where new data duplicates information already stored from previous extractions - it may lead to duplicate entries in your dataset.
To prevent duplicates in the case of overlapping data, consider using the Upsert mode. Upsert ensures that existing records are not duplicated, even when the same data is extracted multiple times.s.
Insert Ignore
New data will be inserted while ignoring error messages. This means that if a wrong data type is used in a record, that record is omitted. This write mode is used rarely.
Example
The data type for the "Amount" column is integer or float. However, there is a field with "Hello Gabi" which means that its data type is string or character. When insert_ignore is used, this row will be omitted when writing into the database.
Truncate Insert
Truncate Insert removes all previously existing content in the table before inserting newly extracted data.
Double-check before selecting this write mode as using this mode will remove all the contents in the target table prior to data load.
Upsert
The Upsert mode operates by inserting new rows and updating existing ones. To utilize this, you need to specify Upsert Composite Key which can represent a single column or a combination of columns.
With this mode, Dataddo verifies the data in the destination prior to inserting new data. If a record with the same Upsert Composite Key already exists, the existing data in the table is updated; if not, a new record is inserted. The Upsert mode thus effectively prevents data duplication.
Upsert Composite Key
Refers to a set of one or more attributes (columns) that uniquely identify each record within a table. When configuring the Upsert mode for data storage, you will be prompted to define such a key.
In the provided example, Dataddo locates the overlapping record in the target table using the Invoice ID as the unique key. The corresponding record in the destination table is updated with this information, while all other records are inserted as new entries.
MySQL, PostgreSQL, MSSQL, Redshift & Vertica
In the creation of a data flow to MySQL, PostgreSQL, MSSQL, Redshift, or Vertica, the structure of the target table, including the necessary indexing, is established. This setup process is automated during the first data load when the UPSERT write mode is selected.
It's crucial to note that this automatic indexing does not apply when the UPSERT write mode is switched on for an already existing table. In such a case, Dataddo cannot automatically create the necessary indexes, which may lead to the UPSERT operation not functioning as expected. Therefore, to utilize UPSERT properly, ensure it is selected during the initial setup of a new table.
Delete
The Delete mode is a specialized write option that deletes specific rows in your destination. This action is performed based on a designated delete composite key. Rows that have a matching value in this delete composite key are removed. This mode is crucial for maintaining data accuracy in certain use cases, but caution is advised as incorrect usage can lead to unintended data loss.
Delete Composite Key
Refers to a set of one or more attributes (columns) that uniquely identify each record within a table. When configuring the Delete mode for data storage, you will be prompted to define such a key.
Troubleshooting
Data Duplicates in the Storage
Data duplicates are most often caused by inapproprite write mode.
By default, the insert write mode is used, meaning all newly extracted data are fully appended in table. However, this might not always be the optimal setup. In case there is data overlap between the data extracts, this will results in duplicates. The solution is to use one of the following write modes:
1. Truncate Insert. This write mode removes all the content in the BigQuery table prior to data insertion.
2. Upsert. This write mode inserts new rows and updates existing ones. To perform this correctly, it is necessary to set a unique key representing one or multiple columns.