Tips and Best Practices for Organizing your Databricks Workspace

Tips and best practice for organizing Databricks Workspace

Are you tired of sifting through a cluttered Databricks Workspace to find the notebook or cluster you need? Do you want to optimize your team’s productivity and streamline your workflow? Look no further! In this guide, we’ll share valuable Tips and Best Practices for Organizing your Databricks Workspace like a pro. Whether you’re a seasoned Databricks user or just getting started, these tips will help you keep your Workspace tidy, efficient, and easy to navigate. So let’s get started and revolutionize the way you work with Databricks!

Databricks workspace offers a wide range of tools and services for Data Engineering, Data Science, and Machine Learning tasks. Organize your workspace by creating folders and notebooks for different projects, using naming conventions and tags, setting up access control, keeping your Workspace clean, and using templates, you can create a Workspace that is easy to navigate and supports your data engineering, data science, and machine learning tasks.

Table of content:

  1. What is Databricks Workspace?
  2. What is Databricks Workspace Organization?
  3. Databricks Workspace Isolation Strategies
  4. Tips and Best Practices for Organizing your Databricks Workspace
  5. Conclusion

What is Databricks workspace?

Databricks workspace is structured and organized to support efficient and effective collaboration, development, and deployment of data engineering and data science projects. A Databricks workspace is a cloud-based platform that allows teams to work together on data engineering and data science projects in a collaborative, integrated, and scalable environment.

Databricks Workspace provides several benefits for data teams, including:

  1. Collaboration: It enables teams to work collaboratively on data projects in a centralized location.
  2. Scalability: It is a scalable platform for managing large datasets and complex workloads.
  3. Automation: It allows users to automate data processing tasks using job scheduling and other features.
  4. Security: It provides several security features, including access controls, data encryption, and compliance certifications.
  5. Ease of Use: It provides support to several programming languages, making it easy for users to work with data.

What is Workspace Organization?

Workspace organization is a way to organize your projects, files, and other resources within your Databricks workspace. The main goal of workspace organization is to provide the structure of your work and make it easy to find and manage your project files.

At the highest level, you have your workspace, which is associated with a cloud storage account where all data for the workspace are stored. However, within the workspace, you can create multiple notebooks, libraries, and folders to organize your projects.

Best Practices for Organizing Databricks Workspace - Workspace organization workflow

E2 master account (AWS) or subscription object (Azure Databricks) is at the top. In AWS, we provide a single E2 account per organization that provides a unified pane of visibility and control to all workspaces. Hence, your activity is centralized with the ability to enable SSO, Audit Logs and Unity Catalog.

Within the top-level account, we can create 20 – 50 workspaces per account on Azure. with the hard limit on AWS. The limit arises from the administrative overhead that stems from a growing number of workspaces: managing collaboration, access, and security across hundreds of workspaces can become a difficult task.

General Consideration for Workspace Organization

When organizing an enterprise workspace in Databricks, it is important to consider the following general factors:

  1. Resource Allocation: Determine how many compute resources each team or group should have access to ensure that these resources are allocated efficiently.
  2. Structure of a Folder: Organize resources into logical folders that reflect the organizational structure of the enterprise. Use folders to manage access to resources and control resource usage.
  3. Security: Implement a robust security model to protect sensitive data and ensure that only authorized users can access critical resources.
  4. Naming conventions: Establish a clear and consistent naming convention for notebooks, clusters, libraries, and other resources to ensure that they are easy to find and identify.
  5. Monitoring and optimization: Monitor resource usage and performance metrics to identify areas for optimization and ensure that workloads are running efficiently.

Databricks Workspace Isolation Strategies

In Databricks, workspace strategies aim to create logical separation of resources and data, which will help you to improve security, collaboration, and management of resources within the organization.

In Databricks we have two common strategies:

  1. Line of Business (LOB)-based isolation Strategy
  2. Product-based isolation strategy

1. Line of Business (LOB)-based Isolation Strategy

This strategy involves creating separate Databricks workspaces for different lines of business or business units within an organization. Each workspace would contain all the notebooks, libraries, and other resources related to a specific business unit. This approach helps to keep the resources and data of each business unit separate and isolated from other business units, which can be important for security and compliance reasons.

In LOB Based workspace strategy, each functional unit will receive workspaces which include development, Staging, and production workspaces. The code is written and tested in DEV, then promoted to STG, and lands in PRD.

In the above diagram, each LOB has a separate admin and cloud account with one workspace in each environment (DEV/STG. PRD). All the workspaces fall under the same Databricks account and leverage the same Unity Catalog. Some variations include sharing cloud accounts using a separate DEV/STG/PRD cloud account or creating separate external meta stores for each LOB.

Here we have some of the benefits of the LOB approach:

  1. This allows each business unit to focus on its specific needs without distraction from each other departments.
  2. In each workspace, a team member can collaborate more easily on projects and share knowledge and resources.
  3. You can protect your sensitive data and prevent unauthorized access. Each workspace has its own access controls and user accounts.

We have some limitations of the LOB approach, which are:

  1. Creating multiple workspaces can increase the cost of running a Databricks environment.
  2. Difficult to manage multiple workspaces and time-consuming.
  3. Limit cross-functional collaboration and knowledge-sharing among teams.

Best practice for building LOB-based Lakehouses:

  1. Make sure that users only have access to what they need by using detailed controls. Only a few people should have access to the production environment. You should identify the provider and sync them with the Lakehouse.
  2. Be aware of the limits on the cloud and Databricks platforms such as several workspaces, API rate limits, and throttling.
  3. Use a standardized catalog to control access to assets and allow to reuse of the materials. Unity Catalog is an example of a catalog that let you manage tables and workspace objects.
  4. You should share the data between different business groups using secure data-sharing methods.

2. Product-based Isolation Strategy

This strategy involves creating separate Databricks workspaces for different products or projects within an organization. Each workspace would contain all the notebooks, libraries, and other resources related to a specific product or project. This approach helps to keep the resources and data of each product or project separate and isolated from other products or projects, which can be important for collaboration and resource management reasons.

The product-based strategy is the same as the LOB strategy but the only difference is that product-based offers more flexibility. In this we isolate top-level projects, giving each a separate production environment. Also, mix in the shared development environment to avoid workspace proliferation and make the reuse of assets simpler.

Here we have some benefits of the product-based approach:

  1. Each workspace focuses on its specific product or service. This will help innovation and will improve productivity.
  2. The presence of a sandbox workspace offers more freedom and less automation than DEV workspaces.
  3. Sharing of resources or workspaces

Best practice for building workspaces for different products:

  1. Use containers to isolate products from each other. By deploying each product in its container, you can ensure that each product has its own set of resources and dependencies.
  2. Break your applications into smaller applications. This will help you to reduce the impact of any failure and will make it easier to update or modify different parts of the application without affecting other applications.
  3. The product should be accessible by the appropriate user. You should implement Role-based access control (RBAC) which will help you to ensure that each user has access only to the product or resources according to their need.
  4. Use a standardized set of APIs to reduce the complexity and for interacting with different products.
  5. Keep an eye on the performance of each product. This will help you to optimize the system efficiently.

Tips and Best Practices for Organizing your Databricks Workspace

Effective workspace organization in Databricks requires careful planning and communication among team members. Teams should establish best practices for naming conventions, folder structures, and version control to ensure consistency and reduce confusion. Teams should also establish policies for access and permissions to ensure that only authorized users can view or modify data and code. Below we have some valuable tips and best practices for organizing your Databricks Workspace:

1. Use ‘Format SQL’/”Format Python” for formatting the code

Format SQL or Format Python makes your (SQL or Python) code more readable and organized. This will help you to identify and fix errors in the code more easily. Additionally, formatting your code also makes it easier for other team members to read and understand your code, which can improve collaboration and productivity. You should use this feature as much as possible.

You can find the “Format SQL code” option in the “Edit” section. Or you can simply press Ctrl + Shift + F.

You can also go to the particular cell and perform the code formatting. This is how your code will look like after using the “Format SQL” option:

This is how your code will look like after using the “Format Python” option:

2. Select programming languages

Selecting the right programming language for Databricks workspace is important to ensure that your workload is efficient, reliable, and scalable. The language depends on the type of cluster access mode clusters access mode:

Spark is the underlying processing engine of Databricks and is developed in Scala. It is optimized for distributed computing and has native support for spark. So, we recommend using Scala programming language as it performs better than Python and SQL. Generally, it is seen that Scala code runs faster than python or SQL code.

3. Directly View the content of a file

Directly viewing files is useful for debugging code, as you can easily examine the contents of the files your code is working with to identify problems. There are better way ways to view the contents of a file instead of loading the whole data into data frame which is a costly operation.

Here is an example to directly view the content of a file by using the “display” function and dbutils utility:

display(dbutils.fs.head("/path/to/file.csv"))

In this example, the “display" function is used to display the contents of the CSV file located at “/path/to/file.csv“. The dbutils.fs.head() function is used to read the first few lines of the file. If we use Dataframe API to read the file and view it then that will be an expensive operation.

4. Use AutoComplete to avoid Typographical errors

AutoComplete is a helpful feature in Databricks that can help you avoid typographical errors by providing suggestions for completing your code as you type. For example below when I pressed the tab after the first two letters it started suggesting to me what are commands available.

By using AutoComplete to select the correct method name, you can avoid typographical errors and improve the accuracy of your code.

5. Use Key Vault for Storing Access Keys

You should avoid hardcoding any sensitive information in your code. Instead, it is advisable to store such information, including storage account keys, database usernames, database passwords, and other sensitive data, in a key vault. By doing so, you can enhance the security of your data and avoid any potential risks. In Databricks, you can access the key vault via a secret scope to securely retrieve sensitive information when needed.

I have written a detailed blogs on this topic:

1.How to create Azure Key Vault-backed secret scope?

2.How to create and use Databricks backed secret scope?

6. Use Azure Data Lake Gen 2 Storage

Using Data Lake Storage will help the organization effectively manage large volumes of data with while providing scalability, integration, cost savings, security and performance benefits.It is recommended to use ADLS Gen 2 for large enterprises. ADLS Gen 2 supports these features:

  • Hierarchical namespace mechanism allows ADLS Gen2 to provide file system performance at object storage scale.
  • HDFS compatible storage most suitaed for big data workloads like Databricks.
  • Optimized storage for big data analytics workloads
  • Better storage retrieval performance.
  • Low cost for Analysis.
  • Designed for enterprise big data Analytics.
  • Supports open source platforms as well.


Here is an example of how to write data from a DataFrame to Azure Data Lake Storage using Databricks Python code using OAuth 2.0:

service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")adls_path)

Similarly, you can use SAS token to access the Data Lake Gen 2

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "<token>")

Or account access keys.

spark.conf.set(
    "fs.azure.account.key.<storage-account>.dfs.core.windows.net",
    dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))

6. Notebook Organization

Organizing your notebooks in a Databricks workspace can help you stay organized and work more efficiently. Suppose you have multiple teams working on a single Databricks workspace. In such a situation, it is advisable to create separate folders or files for each group. You should delete unused notebooks that are no longer needed.

Organizing your notebooks in Databricks will help you to improve the efficiency of the workflow and easier to find and collaborate on your notebooks.

7. Use the ‘Advisor’ option

The Advisor option helps users to optimize their queries and improve query performance. The Advisor uses machine learning algorithms to analyze query execution plans and suggest improvements based on best practices and past performance data.

To enable the ‘Advisor’ option, go to ‘User Settings’ and turn on the toggle next to the’Databricks Advisor’ option.

For example here is the Advisor recommendation for using DBIO cache for the query.

8. Use ADF for invoking Databricks notebooks

Azure Data Factory (ADF) is a cloud-based data integration service that allows you to create, schedule, and orchestrate workflows that move and transform data. It is easier to establish notebook dependencies in ADF than in Databricks itself. It is convenient to debug a series of notebook invocations in the ADF pipeline.

ADF helps you to simplify data integration by providing a visual interface for designing, scheduling, and managing data integration workflows, which makes it easier to create a complex pipeline. ADF provides native integration with Databricks, which eliminates the need for custom code and reduced the time and effort required to integrate Databricks with other sources. In Azure, Databricks is a first-party service and you can connect, ingest and transform data in a single workflow.

9. Notebook Chaining

Notebook chaining is a process of connecting multiple notebooks to form a sequence or pipeline of data processing tasks. This allows users to break down complex data processing workflows into smaller, more manageable steps.

Notebook chaining is accomplished by using the ‘%run' command in a notebook to execute another notebook. When the ‘%run‘ command is executed, the specified notebook is run and any output generated by that notebook is returned to the calling notebook. This output can then be used as input for subsequent notebooks in the pipeline.

It is recommended to include all the used operations such as read/write on Data Lake, SQL Database, etc. in one notebook. The same notebook can be used to set the Spark configuration, mount the ADLS path to DBFS, fetch the secrets from the secret scope, etc.

Notebook chaining will save you time and effort by avoiding duplication of code, making your workflow more efficient, and improving code quality.

Let’s assume you have two notebooks, Notebook A and Notebook B. You want to execute Notebook A first and then use its output as the input to Notebook B.

In Notebook A, you need to save the output data that you want to pass to Notebook B. You can do this by writing the output data to a file or database.

# code in Notebook A
# read input data
df = spark.read.csv("input_data.csv", header=True)

# perform data transformations
df_transformed = df.filter("column1 > 0")

# write output data to a file
df_transformed.write.csv("output_data.csv", mode="overwrite", header=True)

In Notebook B, you need to read the output data from Notebook A and use it as the input for further processing.

# code in Notebook B

# read input data from Notebook A
df = spark.read.csv("dbfs:/FileStore/path_to_output_data.csv", header=True)

# perform further data transformations
df_transformed = df.groupBy("column2").count()

# write output data to a file
df_transformed.write.csv("final_output_data.csv", mode="overwrite", header=True)

To automate the execution of Notebook A followed by Notebook B, you can use the Databricks Job feature. Create a new job, and set the command to run both notebooks in sequence using the %run command.

%run "/path_to/Notebook A"
%run "/path_to/Notebook B"

You can schedule the job to run at a specific time, or trigger it manually as needed.This is a basic example of notebook chaining in Databricks, but the approach can be extended to chain together multiple notebooks in more complex data workflows.

10. Use Widget Variables

Widget variables allow users to create interactive widgets that can be used to adjust parameters or input data directly from the Databricks notebook interface. The configuration details help you to access the Databricks code through the Widget variables. The configuration data is transferred from the pipeline variable to widget variables when the notebook is invoked in the ADF pipeline.

Let’s assume you have a dataset of customer transactions, and you want to filter the data based on the transaction date using a Databricks Widget variable.

STEP 1. First, you need to create the Databricks Widget variable. In the notebook cell, add the following code:

# create widget for transaction date
dbutils.widgets.text("transaction_date", "2022-01-01", "Transaction Date (yyyy-mm-dd)")

This will create a text box widget with the label “Transaction Date (yyyy-mm-dd)” and a default value of “2022-01-01”.

STEP 2. Next, you can use the widget variable to filter the transaction data. In the next cell, add the following code:

# get transaction date from widget variable
transaction_date = dbutils.widgets.get("transaction_date")

# read transaction data
df = spark.read.csv("transaction_data.csv", header=True)

# filter data by transaction date
df_filtered = df.filter(f"transaction_date >= '{transaction_date}'")

# display filtered data
display(df_filtered)

This code uses the dbutils.widgets.get function to retrieve the value of the “transaction_date” widget variable. It then reads in the transaction data from a CSV file and filters the data based on the transaction date using the filter function. Finally, it displays the filtered data using the display function.

STEP 3. Now you can test the widget variable by running the notebook and entering a new transaction date in the text box. The notebook will automatically update the filtered data based on the new input.

11. Use Sandbox environment

A sandbox environment in Databricks Workspace is a separate area where users can experiment and develop code without affecting the production environment. It provides a safe and isolated environment where users can try out new features, test code, and experiment with different configurations.

A sandbox environment provides a place where users can make changes and test them without risking production workloads. This will reduce the risk of error and issues as well as provide a controlled environment for testing new ideas and approaches.

12. Establish a COE (Center of Excellence) team

Establishing a COE team in Databricks Workspace can help ensure that the platform is used effectively and efficiently across the organization while minimizing risks and maximizing benefits. It can also help foster a culture of innovation and collaboration, and enable the organization to fully leverage the power of Databricks for data processing, analysis, and reporting. COE establish guidelines for data privacy and security, ensuring that sensitive information is protected at all time.

13. Use Log Analysis Workspace

Log Analytics Workspace in Databricks Workspace helps you to monitor and analyze logs generated by various Databricks services, such as clusters, jobs, and notebooks. By analyzing these logs, you can gain insights into the performance and usage of your Databricks environment, identify issues, and troubleshoot problems.

Monitoring Databricks resources will help you to choose the right size for your cluster and virtual machine (VM). Each VM has limits that affect how well a job runs on Azure Databricks. To see how much a cluster is being used, you can install a Log Analytics Agent on each cluster node to stream data to an Azure Log Analytics Workspace.

14. Do not store data in default DBFS (Databricks File System)

Every workspace comes with default DBFS to store libraries, scripts, and more. You should not store the important data in the default DBFS because if you delete the workspace, the default DBFS and all its contents will also be deleted permanently.

To avoid the issues you should create specific DBFS folders for storing data with appropriate access controls and permissions. This will help you to ensure that data is secured, organized and easily accessible.

Conclusion

Databricks is a platform that provides a collaborative workspace for data teams. By following the above best practices, you will able to organize, secure and scale your Databricks workspace. Using a structured folder hierarchy, version control, DBFS for data storage, job scheduling, and access controls will help you to collaborate effectively, automate tasks, and manage data-related workloads efficiently.

Leave a Reply

Your email address will not be published. Required fields are marked *