- 3 Minutes to read
- DarkLight
Databricks
- 3 Minutes to read
- DarkLight
The Databricks Lakehouse Platform combines the best of data lakes and data warehouses, simplifying the modern data stack and eliminating data silos. Built on open standards and open source, the platform provides a common approach to data management, security, and governance, enabling businesses to operate more efficiently, innovate faster, and achieve the full potential of their analytics and AI initiatives.
Prerequisites
- You have a running Databricks SQL warehouse.
Architecture Consideration
In general, there are two main ways to set automatic data load to Databricks using Dataddo.
Using intermediate object storage such as Amazon S3 or Azure Blob Storage. We recommend using this option when you need to write large volumes of the data in low-frequencies (e.g. more than 1M rows once a day). You will need to
- Configure the flows these destinations using Parquet format and
- Configure the Auto Loader in Databricks Delta Lake.
Using Databricks as direct destination. Dataddo Databricks writer uses an SQL layer which means no further configuration on Databricks side is required. We recommend using this option when you need to load relatively low-volume of the data in high-frequencies or to achieve CDC style data replication.
This page is applicable when using Databricks as a direct destination for your Dataddo flow. In case you are considering loading the data via Amazon S3 or Azure Blob Storage or Azure Blob Storage, please navigate to corresponding articles.
Authorize Connection to Databricks
In Databricks
Create an SQL Warehouse
- Login to the Databricks workspace.
- Click on SQL Warehouses on the sidebar.
- Enter a Name for the warehouse and accept the default warehouse settings.
- Click on Create.
Configure the Access for the SQL Warehouse
- In the Databricks workspace click on SQL and then SQL Warehouses.
- Choose the warehouse and navigate to the Connection Details tab.
- Get the full DSN connection string, you will need to provide this to Dataddo.
In Dataddo
- In the Authorizers tab, click on Authorize New Service and select Databricks.
- Select that you want to connect via DSN connection string.
- You will be asked to fill the following fields:
- DSN Connection String: The value obtained during SQL warehouse access configuration step.
- Catalog: Sets the initial catalog name for the connection. The default value is hive_metastore.
- Schema: Sets the initial schema name. The default value is default.
- Save the authorization details.
Create a New Databricks Destination
- On the Destinations page, click on the Create Destination button and select the destination from the list.
- Select your authorizer from the drop-down menu.
- Name your destination and click on Save.
Click on Add new Account in drop-down menu during authorizer selection and follow the on-screen prompts. You can also go to the Authorizers tab and click on Add New Service.
Creating a Flow to Databricks
- Navigate to Flows and click on Create Flow.
- Click on Connect Your Data to add your source(s).
- Click on Connect Your Data Destination to add the destination.
- Choose the write mode and fill in the other required information.
- Check the Data Preview to see if your configuration is correct.
- Name your flow and click on Create Flow to finish the setup.
Table Naming Convention
When naming your table, please make sure the table name:
- Is all in lowercase
- Starts with a lowercase letter or an underscore
- Contains only
- Letter
- Numbers
- Underscores
Troubleshooting
Failed Databricks Action
ERROR MESSAGE
Action failed: stream transfer: write data from stream: initializing writer instance: connecting to database: pinging database server: databricks: execution error: failed to execute query: context deadline exceeded"
Databricks clusters may enter a timeout after some time in idle mode. When restarting, Databricks cluster will take some time to restart. As such, this issue may be caused by situations when the cluster restart time overlaps with Dataddo actions causing the actions to fail.
To avoid this, make sure Dataddo actions are scheduled during cluster uptime schedule.