- 3 Minutes to read
- 3 Minutes to read
The Databricks Lakehouse Platform combines the best of data lakes and data warehouses, simplifying the modern data stack and eliminating data silos. Built on open standards and open source, the platform provides a common approach to data management, security, and governance, enabling businesses to operate more efficiently, innovate faster, and achieve the full potential of their analytics and AI initiatives.
- You have a running Databricks SQL warehouse.
In general, there are two main ways to set automatic data load to Databricks using Dataddo.
Using intermediate object storage such as AWS S3 or Azure Blob Storage. We recommend using this option when you need to write large volumes of the data in low-frequencies (e.g. more than 1M rows once a day). You will need to
- Configure the flows these destinations using Parquet format and
- Configure the Auto Loader in Databricks Delta Lake.
Using Databricks as direct destination. Dataddo Databricks writer uses an SQL layer which means no further configuration on Databricks side is required. We recommend using this option when you need to load relatively low-volume of the data in high-frequencies or to achieve CDC style data replication.
Authorize the Connection to Databricks
Create an SQL Warehouse
- Login to the Databricks workspace.
- Click on SQL Warehouses on the sidebar.
- Enter a Name for the warehouse and accept the default warehouse settings.
- Click on Create.
Configure the Access for the SQL Warehouse
- In the Databricks workspace click on SQL and then SQL Warehouses.
- Choose the warehouse and navigate to the Connection Details tab.
- Get the full DSN connection string, you will need to provide this to Dataddo.
- In the Authorizers tab, click on Authorize New Service and select Databricks.
- Select that you want to connect via DSN connection string.
- You will be asked to fill the following fields:
- DSN Connection String: The value obtained during SQL warehouse access configuration step.
- Catalog: Sets the initial catalog name for the connection. The default value is hive_metastore.
- Schema: Sets the initial schema name. The default value is default.
- Save the authorization details.
Create a New Databricks Destination
- Under the Destinations tab, click on the Create Destination button and select the destination from the list.
- Select your account from the drop-down menu.
- Name your destination and click on Save to create your destination.
Click on Add new Account in drop-down menu during authorizer selection and follow the on-screen prompts. You can also go to the Authorizers tab and click on Add New Service.
Creating a Flow to Databricks
- Navigate to Flows and click on Create Flow.
- Click on Connect Your Data to add your sources.
- Click on Connect Your Data Destination to add the destination.
- Choose the Write mode and fill in the other required information.
- Check the Data Preview to see if your configuration is correct.
- Name your flow, and click on Create Flow to finish the setup.
Table Naming Convention
When naming your table, please make sure the table name:
- Is all in lowercase
- Starts with a lowercase letter or an underscore
- Contains only
File partitioning refers to the practice of dividing large datasets into smaller, more manageable segments or partitions based on specific criteria, such as values in a particular column or range of values. Each partition contains a subset of the data that shares common attributes or characteristics. File partitioning is commonly used to improve data organization, query performance, and data management.
Dataddo supports file partitioning during flow creation. If you use, for example,
Dataddo will create a file every day, e.g. file_20xx0101, file_20xx0102 etc.