Databricks
  • 2 Minutes to read
  • Dark
    Light

Databricks

  • Dark
    Light

Article Summary

The Databricks Lakehouse Platform combines the best of data lakes and data warehouses, simplifying the modern data stack and eliminating data silos. Built on open standards and open source, the platform provides a common approach to data management, security, and governance, enabling businesses to operate more efficiently, innovate faster, and achieve the full potential of their analytics and AI initiatives.

Prerequisites

Architecture Consideration

In general, there are two main ways to set automatic data load to Databricks using Dataddo.

  • Using intermediate object storage such as AWS S3 or Azure Blob Storage. We recommend using this option when you need to write large volumes of the data in low-frequencies (e.g. more than 1M rows once a day). You will need to

    • Configure the flows these destinations using Parquet format and
    • Configure the Auto Loader in Databricks Delta Lake.
  • Using Databricks as direct destination. Dataddo Databricks writer uses an SQL layer which means no further configuration on Databricks side is required. We recommend using this option when you need to load relatively low-volume of the data in high-frequencies or to achieve CDC style data replication.

This page is applicable when using Databricks as a direct destination for your Dataddo flow. In case you are considering loading the data via AWS S3 or Azure Blob Storage or Azure Blob Storage, please navigate to corresponding articles.

Authorize Connection to Databricks

In Databricks

Create an SQL Warehouse

  1. Login to the Databricks workspace.
  2. Click on SQL Warehouses on the sidebar.
  3. Enter a Name for the warehouse and accept the default warehouse settings.
  4. Click on Create.

Configure the Access for the SQL Warehouse

  1. In the Databricks workspace click on SQL and then SQL Warehouses.
  2. Choose the warehouse and navigate to the Connection Details tab.
  3. Get the full DSN connection string, you will need to provide this to Dataddo.

In Dataddo

  1. In the Authorizers tab, click on Authorize New Service and select Databricks.
  2. Select that you want to connect via DSN connection string.
  3. You will be asked to fill the following fields:
    1. DSN Connection String: The value obtained during SQL warehouse access configuration step.
    2. Catalog: Sets the initial catalog name for the connection. The default value is hive_metastore.
    3. Schema: Sets the initial schema name. The default value is default.
  4. Save the authorization details.

Create a New Databricks Destination

  1. On the Destinations page, click on the Create Destination button and select the destination from the list.
  2. Select your account from the drop-down menu.
  3. Name your destination and click on Save to create your destination.
Need to authorize another connection?

Click on Add new Account in drop-down menu during authorizer selection and follow the on-screen prompts. You can also go to the Authorizers tab and click on Add New Service.

Creating a Flow to Databricks

  1. Navigate to Flows and click on Create Flow.
  2. Click on Connect Your Data to add your sources.
  3. Click on Connect Your Data Destination to add the destination.
  4. Choose the write mode and fill in the other required information.
  5. Check the Data Preview to see if your configuration is correct.
  6. Name your flow, and click on Create Flow to finish the setup.

Table Naming Convention

When naming your table, please make sure the table name:

  • Is all in lowercase
  • Starts with a lowercase letter or an underscore
  • Contains only
    • Letter
    • Numbers
    • Underscores


Was this article helpful?

What's Next