Databricks is a version of the popular open-source Apache Spark analytics and data processing engine. Azure Databricks is the fully managed version of Databricks and is a premium offering on Azure, that brings you an enterprise-grade and secure cloud-based Big Data and Machine Learning platform.
Data can be ingested in a variety of ways into Azure Databricks. For real-time Machine learning projects, you can ingest data through a wide range of technologies including Kafka, Event Hubs or ,IoT Hubs. In addition, you can ingest batches of data using Azure Data Factory from a variety of data stores including Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse which can then be used in the Spark-based engine within Databricks.
In this article, we are going to connect the data bricks to Azure Data Lakes.
Here is the step-by-step approach to establishing the connectivity from Databricks to Data Lake:
STEP 1: Create a Resource Group
Azure Data Lake can manage its key on its own or we can store the key in Azure Key Vault. Here we are going to create a Data Lake store and create an azure key vault to store the Data Lake key.
STEP 2: Create an Azure Data Lake and Key Vault to store the Azure Data lake key in the key vault.
STEP 3: Create a Key to access the Azure Data Lake.
Here I have provided the key name as mykey but you can put any name.
STEP 4: Now after creating the key Azure Data Lake will not allow the key vault to access the data lake unless you grant the access so it will show this error:
So you need to grant the access either by granting permission or using the Powershell command-let shown below:
Here I am providing the Powershell command-let:
Connect-AzAccount Get-AzSubscription #You need to copy the subscription id from Get-AzSubscription command and #copy in the below command Select-AzSubscription -Subscriptionid YOURSUBSCRIPTIONUID Set-AzKeyVaultAccessPolicy -VaultName myownkeyvaultfordatalake -ObjectId bbdd521b-b7ab-4074-b244-5e619654e1fd -PermissionsToKeys encrypt,decrypt,get
Once the permission is granted it will look like this.
STEP 5: Create App Registration in Azure AD. Go to Azure ADD and click the App registration and click account types. Please note that you can give any fake URL in the Redirect URl option.
Once App Registration is over it will show you this screen so please copy the Application ID and Directory ID in a notepad because we will use them in the Databricks workspace.
The service Principal will look like this.
STEP 6: Create the client secret and save it in the notepad.
STEP 7: Now Create a Databricks cluster and make sure it is the premium plan. Once the cluster is up and running create a python workbook and paste this code.We will use three different ways to store the credentials stored in service principal in the previous step.. Please note the client ID, Credential and Directory ID taken from previous steps
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential", "dfs.adls.oauth2.client.id": "Client ID from STEP 5", "dfs.adls.oauth2.credential": "Client secrete from step 6", "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/Directory ID from Step 5/oauth2/token" }
Step 8: Before using the data lake from Databricks we need to mount it.
#Mount the ADLS dbutils.fs.mount( source = "adl://yourdatalakename.azuredatalakestore.net", mount_point = "/mnt", extra_configs = configs)
STEP 9: Verify if Data Lake contents are accessible thru Databricks.
Suppose you have a RAW folder in the Data Lake and this folder contains the CSV file named export then it can be accessed like this.
#Access the file dbutils.fs.ls("/mnt/RAW/export.csv")
Now let’s create a table from it.
#Create a SQL table and import the data %sql DROP TABLE IF EXISTS incident_data; CREATE TABLE incident_data USING CSV OPTIONS ( path "/mnt/RAW/export.csv", header "true", inferschema 'true' )
and now browse the results in the Databricks
#browse %sql SELECT * from incident_data