As a data and AI engineer, you are tasked with ensuring that all operations run smoothly. But how do you ensure that the information stored in the Azure Databricks is managed correctly? The answer lies in its Unity Catalog, which is dedicated to providing users with a central catalog of tables, views, and files for easy retrieval. In this blog post, we’ll be demystifying what an Azure Databricks Unity Catalog really does and discussing best practices on utilizing it for governance within your organization’s data & analytics environment.
What is Azure Databricks Unity Catalog and what are its benefits
Azure Databricks Unity Catalog is an AI-powered, fully managed Azure data platform that facilitates unified and automated data governance. Its significance is derived from its two core capabilities: search & discovery and data lineage tracing. Azure Databricks Unity Catalog acts as a central control point for seamless access to all organizational data across the Azure environment. This helps improve privacy, reduces overhead in both cost and effort associated with manual adherence to data governance processes, and allows for efficient query acceleration for teams to perform faster analytics tasks. Azure Databricks Unity Catalog also provides valuable insights about the structure of the organization’s data ecosystem, helping users make better decisions on the utilization of resources.
Unity catalog uses a hierarchical structure. This is explained in the diagram:
Unity catalog can be utilized by attaching multiple workspaces to it where these workspace users are utilizing specific catalogs and user permissions assigned to specific users.
Unity catalog uses a component called access connector (which can be created via Portal or Azure CLI). By using the managed identity of the access connector it accesses the data sources. For example, it gets the storage blob contributor permission on Storage blob. Azure databricks uses the same managed identity to access the connector which in turn has direct access to the data source. This simplifies the overall access approach without sacrificing security.
How to deploy the Databricks unity catalog in less than 5 minutes
Deploying an Azure Databricks catalog can be achieved through a relatively straightforward process. First, it is important to ensure that you have the appropriate permissions to perform deployments of unity catalogs. Then, create the necessary Databricks services and connect those services through Azure AD authentication. Once the necessary steps are complete, one can deploy the desired catalog using commands provided by Databricks CLI or SQL notebooks. Following these steps will help ensure a successful deployment of an Azure Databricks Catalog in the shortest possible time so you do not spend more time creating the catalog:
- Install the Databricks CLI on your local machine by following the instructions in the Databricks documentation.
- Install Azure CLI by following these instructions from Microsoft.
- Create a Databricks workspace if you do not already have one.
- Install jq utility
- Create a databricks token by running the command and make sure to store the token value in a notepad because it will be used later while connecting to Databricks.
databricks tokens create --lifetime-seconds 129600 --comment "My comment."
6. Connect to your Databricks workspace using the Databricks CLI by running the command it will ask for the token saved in step 5.
databricks configure --token
7. Now Powershell ISE and copy this command. Do not worry we can still run the Azure CLI commands in Powershell ISE with minor modifications. Feel free to update the values of the variables used in the script.
#1.set these variables value $ResourceGroup="unity" $StorageAccountName="unitydemo7" $location="eastus" $container="metastore" $metastoreName="demoMeta" $DatabricksConnectorName="UnityAccessConnector" $storageaccounturi="abfss://"+$container+"@"+"$StorageAccountName"+".dfs.core.windows.net/" $storageCredName="metastorecred" $catalogName="quickstart_catalog" $schemaName="quickstart_schema" #How to find the workspace id is #open the workspace url for example: https://adb-3054609905645524.4.azuredatabricks.net/?o=3054609905645524# #The last part of the url o=3054609905645524# is workspace ID $databricksWS=3054609905645524 #2.Connect to Azure environment az login #3.create a resource group az group create --location $location --name $ResourceGroup #4.Create storage account and store it in a variable for further use.Note we have enabled the Heirarchical Namespace $storageAccountID=az storage account create -n $StorageAccountName -g $ResourceGroup -l $location --sku Standard_LRS --enable-hierarchical-namespace --query "id" #4.1(Optional).If you forget to enable HSN then you can use this command enable the Heirarchical namespace az storage account hns-migration start --type upgrade --name $StorageAccountName --resource-group $ResourceGroup #5.Retrieve storage access keys of the storage account $key=az storage account keys list -g $ResourceGroup -n $StorageAccountName --query "[1].value" #6. Create storage container by using the key retrieved earlier az storage container create --account-name $StorageAccountName --name $container --account-key $key #7.Create databricks accessConnector and get's it URI and Principal $DatabricksAccessConnectorUri=az databricks access-connector create --resource-group $ResourceGroup --name $DatabricksConnectorName --location $location --identity-type SystemAssigned --query "id" $DatabricksAccessConnectorPrincipalId=az databricks access-connector create --resource-group $ResourceGroup --name $DatabricksConnectorName --location $location --identity-type SystemAssigned --query "identity.principalId" #8.Grant databricks access connector "storage Blob contributor" permission on Storage account az role assignment create --role "Storage Blob Data Contributor" --assignee $DatabricksAccessConnectorPrincipalId --scope $storageAccountID #9. Create Metastore in Unity Catalog $metastoreid=databricks unity-catalog metastores create --name $metastoreName --region eastus --storage-root $storageaccounturi |jq '.metastore_id' #10. Assign workspace to the Metastore databricks unity-catalog metastores assign --workspace-id $databricksWS --metastore-id $metastoreid #11.Create Unity catalog Storage credential and store credential ID while creating the storage credential $credentialID=databricks unity-catalog storage-credentials create --name $storageCredName --az-mi-access-connector-id $DatabricksAccessConnectorUri | jq '.azure_managed_identity.credential_id' #11.1(Optional) If you already have the existing credential then use this command to find the storage credential $credentialID=databricks unity-catalog storage-credentials get --name $storageCredName | jq '.id' #12.0 Update permissions to the storage credentials #12.1 Save this content in the file as PrevilegeforStorageCred.json and refer the file path in the next command { "changes":[ { "principal":"account users", "add":[ "CREATE_EXTERNAL_LOCATION", "CREATE_EXTERNAL_TABLE", "READ_FILES", "WRITE_FILES" ] } ] } #Please feel free to update the permissions for a specific user. Ther values used here allows all the users all the permissions but it is not the case for the real prod environment #12.2 Update the storage credential permissions from the json file saved in 12.1 databricks unity-catalog permissions update --storage-credential $storageCredName --json-file "C:\Users\kalpa\OneDrive\Important Scripts\PrevilegeforStorageCred.json" #13. Assign storage credentials to Metastore created before databricks unity-catalog metastores update --id $metastoreid --storage-root-credential-id $credentialID #14.Create a catalog databricks unity-catalog catalogs create --name $catalogName #15.Create Schema databricks unity-catalog schemas create --catalog-name $catalogName --name $schemaName ###Some useful commands for unity Catalog #1.How to get the detail of the current metastore databricks unity-catalog metastores get-summary #2.To list the workspaces databricks workspace ls #3.To find the list of token databricks tokens list
Note: You need to have proper admin permissions to perform the above steps, check with your Databricks admin for the same.
8. Now you have the catalog and schema created you can now create the table and insert the records into the table and view the table by using the below SQL commands.
--1. Set the current schema USE quickstart_schema; --2. create a table in the schema CREATE TABLE IF NOT EXISTS quickstart_table (columnA Int, columnB String) PARTITIONED BY (columnA); --3. Create a managed Delta table and insert two records --CREATE TABLE IF NOT EXISTS quickstart_table --(columnA Int, columnB String) PARTITIONED BY (columnA); INSERT INTO TABLE quickstart_table VALUES (1, "one"), (2, "two"); --4. View all tables in the schema SHOW TABLES IN quickstart_schema; --5. Describe this table DESCRIBE TABLE EXTENDED quickstart_table; --6. Query the table using the three-level namespace SELECT * FROM quickstart_catalog.quickstart_schema.quickstart_table; --6.1 Another way to query the table by Setting the default catalog and query the table using the schema and table name USE CATALOG quickstart_catalog; SELECT * FROM quickstart_schema.quickstart_table; --6.2 Set the default catalog and default schema and query the table using the table name USE CATALOG quickstart_catalog; USE quickstart_schema; SELECT * FROM quickstart_table;
You can use these SQL commands after the 13th step in the PowerShell code snippet provided above.
--1. create a new catalog CREATE CATALOG IF NOT EXISTS quickstart_catalog; --2. Set the current catalog USE CATALOG quickstart_catalog; --3. Show all catalogs in a metastore SHOW CATALOGS; --4. Grant create schema, create table, create view, & use catalog permissions to all users on the account --- This also works for other account-level groups and individual users GRANT CREATE SCHEMA, CREATE TABLE, CREATE VIEW, USE CATALOG ON CATALOG quickstart_catalog TO `account users`; --5. Check grants on the quickstart catalog SHOW GRANT ON CATALOG quickstart_catalog; --6. Create a new schema in the quick_start catalog CREATE SCHEMA IF NOT EXISTS quickstart_schema COMMENT "A new Unity Catalog schema called quickstart_schema"; --7. Show schemas in the selected catalog SHOW SCHEMAS; --8. Describe a schema DESCRIBE SCHEMA EXTENDED quickstart_schema; --9. Set the current schema USE quickstart_schema; --10. create a table in the schema CREATE TABLE IF NOT EXISTS quickstart_table (columnA Int, columnB String) PARTITIONED BY (columnA); --11. Create a managed Delta table and insert two records --CREATE TABLE IF NOT EXISTS quickstart_table --(columnA Int, columnB String) PARTITIONED BY (columnA); INSERT INTO TABLE quickstart_table VALUES (1, "one"), (2, "two"); --12. View all tables in the schema SHOW TABLES IN quickstart_schema; --13. Describe this table DESCRIBE TABLE EXTENDED quickstart_table; --14. Query the table using the three-level namespace SELECT * FROM quickstart_catalog.quickstart_schema.quickstart_table; --15. Another way to query the table by Setting the default catalog and query the table using the schema and table name USE CATALOG quickstart_catalog; SELECT * FROM quickstart_schema.quickstart_table; --16. Set the default catalog and default schema and query the table using the table name USE CATALOG quickstart_catalog; USE quickstart_schema; SELECT * FROM quickstart_table;
How to use the Databricks unity catalog in your projects
To use the Databricks Unified Catalog in your projects, you will first need to have access to a Databricks workspace. Once you have access, you can follow these steps:
1. Go to the Databricks workspace and navigate to the Catalog tab.
2. In the Catalog tab, you can create, view, and manage tables, databases, and other data assets.
- To create a new table, use the catalog and schema in SQL and then use “Create Table SQL statement” by defining the column structure. You can also create an external table pointing to data stored outside Databricks.
- To view an existing table, click on the “Tables” button and select the table you want to view. You can also search for tables using the search bar at the top of the Catalog tab.
- To view the metadata for a table, click on the table and then click on the “Edit” button. You can add or modify the table’s properties and tags, and also add or modify the schema for the table.
- You can also set the permissions on a table or catalog, for example, I have set up the permission for a table for all users with all privileges.
- You can also use Databricks API to interact with the catalog programmatically.
Tips and tricks for working with the catalog
Here are some tips and tricks for working with the Databricks Unified Catalog:
- Use tags to organize your tables: You can add tags to tables to help you organize and search for them more easily. For example, you could tag all tables related to a specific project or department with a specific tag.
- Utilize the search function: The search function in the Catalog tab allows you to quickly find tables based on their name, description, or tags. This can be especially useful if you have a large number of tables in your workspace.
- Use the table lineage feature: Databricks allows you to view the lineage of a table, including the source and destination of data, as well as the transformations that were applied to it. This can be helpful in understanding where your data came from and how it was processed.
- Utilize the schema management feature: The Databricks Unified Catalog allows you to manage the schema of your tables, including adding new columns, modifying existing columns, and removing columns. This can be useful if you need to change the structure of your data over time. This capability is only possible with SQL/Python or Scala code in notebooks.
- Use the Databricks API: The Databricks API allows you to interact with the Catalog programmatically, which can be useful for automating tasks such as creating or modifying tables, or for building custom applications that interact with the Catalog.
- Use the Databricks notebook to work with the catalog: You can use the Databricks notebook to work with the catalog data and create visualizations, perform data science and machine learning tasks, and share your work with others. This is a very critical feature.
- Use the Catalog as your single source of truth: As much as possible, use the Catalog as the single source of truth for your data. This will make it easier to understand where your data is coming from, how it’s being used, and who’s responsible for it.
- Use the Databricks access control feature: Databricks allows you to control access to your data by setting up fine-grained access controls, such as allowing only certain users or groups to view or modify certain tables or databases.
FAQs about Azure Databricks Unity Catalog
1. What is the Azure Databricks Unified Catalog?
Answer: The Azure Databricks Unified Catalog is a feature of the Azure Databricks platform that allows you to manage and discover data assets, such as tables and databases, across your organization.
2. How does the Unified Catalog differ from the traditional Hive Metastore?
Answer: The Unified Catalog is an upgraded version of the Hive Metastore that provides a more user-friendly interface, more advanced search and discovery capabilities, and improved support for non-Hive data sources. It also provides integration with other Azure services like Data Factory and Data lake
3. How can I access the Unified Catalog?
Answer: You can access the Unified Catalog by navigating to the Catalog tab in your Azure Databricks workspace.
4. How can I create a new table in the Unified Catalog?
Answer: You can create a new table in the Unified Catalog by writing SQL code to define the table’s format and structure. You can also create an external table pointing to data stored outside Databricks.
5. Can I use the Unified Catalog to manage my data outside of Azure Databricks?
Answer: Yes, the Unified Catalog can be used to manage data stored in other locations, such as Azure Data Lake Storage, by creating external tables that point to that data.
6. Can I use the Unified Catalog to manage my data’s metadata?
Answer: Yes, the Unified Catalog allows you to manage the metadata for your data assets, such as adding or modifying properties and tags, and also adding or modifying the schema for the table.
7. How can I search and discover data assets in the Unified Catalog?
Answer: You can search and discover data assets in the Unified Catalog by using the search function in the Catalog tab, by browsing through the available tables and databases, or by using the table lineage feature to view the lineage of a table.
8. Can I control access to my data in the Unified Catalog?
Answer: Yes, Azure Databricks allows you to control access to your data by setting up fine-grained access controls, such as allowing only certain users or groups to view or modify certain tables or databases.
Conclusion
In conclusion, the Databricks Unified Catalog offers a powerful set of features that makes working with data easier and more efficient. By using tags to organize tables, using the search function, viewing table lineage, managing schemas, using the API and notebook to work with catalog data, ensuring you use the Catalog as a single source of truth, and taking advantage of Databricks’ access control features, it is easy to see how this platform can help make understanding and leveraging data easier for any organization.