From Slow to Go: How to Optimize Databricks Performance Like a Pro

Are you tired of waiting around for your big data to process? It’s time to take matters into your own hands and optimize your Databricks performance like a pro! With the right tips and tricks, you can transform sluggish data processing into lightning-fast insights. In this guide, we’ll show you how to go from slow to go with Databricks performance optimization. Get ready to supercharge your big data processing and unlock the full potential of your business’s data-driven decisions!

Adaptive Query Execution.
How to handle Data Skew?
How to prevent the spill

Adaptive Query Execution

Adaptive Query Execution is a feature of Databricks that dynamically adjusts query execution based on the characteristics of the data and the resources available in the cluster. Here’s an example of how Adaptive Query Execution can improve query performance in Databricks:

Let’s say we have a large dataset containing customer orders from an e-commerce website. We want to analyze the data to identify the top-selling products by category. We write a SQL query to group the data by product category, sum the sales amounts, and then sort the results by the sales amounts in descending order:

SELECT product_category, SUM(sales_amount) as total_sales
FROM orders
GROUP BY product_category
ORDER BY total_sales DESC

Initially, Databricks runs the query using a hash aggregation algorithm, which works well when the data fits into memory. However, if the dataset is too large to fit into memory, the hash aggregation algorithm can become very slow and may even cause out-of-memory errors.

With Adaptive Query Execution, Databricks can dynamically switch to a different algorithm, such as a sort-based aggregation, if the hash aggregation algorithm becomes too slow. The sort-based aggregation algorithm uses disk storage to sort the data before aggregating it, which can be slower than the hash-based algorithm but can handle larger datasets.

If the dataset is too large to fit into memory or if the hash aggregation algorithm is taking too long, Databricks can also automatically increase the number of partitions to improve parallelism and reduce the time to complete the query.

In this way, Adaptive Query Execution can improve query performance and prevent out-of-memory errors by dynamically adjusting query execution based on the characteristics of the data and the resources available in the cluster.

How to enable Adaptive query execution

To enable Adaptive Query Execution in Databricks, follow these steps:

Open your Databricks workspace and go to the cluster where you want to enable adaptive query execution.
Click on the “Advanced Options” tab.
Scroll down to the “Spark” section and find the “Spark Config” field.
In the “Spark Config” field, add the following configuration property:spark.sql.adaptive.enabled=true
Click on “Confirm” to save the changes.
Restart the cluster for the changes to take effect.

Once you have enabled adaptive query execution, Spark will automatically optimize your queries based on the data and resources available. This can improve the performance of your queries and reduce the time it takes to execute them.

How to handle Data Skew?

Data skew is a common problem in big data processing where certain partitions or data elements contain significantly more data than others. This can cause performance issues and can even lead to out-of-memory errors. Here’s an example of data skew in Databricks and a solution to address it:

Let’s say we have a large dataset of customer transactions and we want to calculate the total revenue by product category. We write a SQL query to group the data by product category and sum the sales amounts:

SELECT product_category, SUM(sales_amount) as total_sales
FROM orders
GROUP BY product_category
ORDER BY total_sales DESC

However, upon running this query, we notice that some of the partitions have significantly more data than others, leading to a data skew problem. For example, the “Electronics” category might have 10 times more transactions than the “Books” category, causing the query to take much longer to complete.

One solution to this problem is to use a technique called data repartitioning, where we redistribute the data across partitions to achieve a more even distribution. In Databricks, we can use the repartition() function to achieve this. Here’s an example:

-- Repartition the data into 100 partitions
val transactions = spark.table("transactions").repartition(100)

-- Group the data by product category and sum the sales amounts
val result = transactions.groupBy("product_category")
  .agg(sum("sales_amount").as("total_sales"))
  .orderBy(desc("total_sales"))

In this example, we use the repartition() function to redistribute the data into 100 partitions. This should help to reduce the data skew problem and improve query performance. Note that the optimal number of partitions will depend on the size of the data, the resources available in the cluster, and other factors.

How to use a combiner to handle Skew

Suppose we have a large dataset of online retail transactions and we want to compute the total revenue generated by each customer. The data is stored in a table called transactions with the following schema:

root
 |-- transaction_id: integer (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- price: double (nullable = true)

We can use the following SQL query to compute the total revenue generated by each customer:

SELECT customer_id, SUM(quantity * price) AS revenue
FROM transactions
GROUP BY customer_id

However, this query may suffer from data skewness if a small number of customers generate a large portion of the revenue. This can cause a single worker node to handle a large amount of data and slow down the query.

To address this issue, we can use a combiner to perform a partial aggregation of the data before shuffling it over the network. The combiner can operate on the intermediate data produced by the mapper to reduce the amount of data being sent over the network.

Here’s an example of how we can use a combiner in Databricks to handle data skewness:

-- Define the combiner function
CREATE TEMPORARY VIEW revenue_combiner AS
SELECT customer_id, SUM(quantity * price) AS revenue
FROM transactions
GROUP BY customer_id % 10

-- Use the combiner in the query
SELECT customer_id, SUM(revenue) AS total_revenue
FROM (
  SELECT customer_id % 10 AS customer_mod, revenue
  FROM (
    SELECT customer_id, SUM(quantity * price) AS revenue
    FROM transactions
    GROUP BY customer_id
  )
) intermediate
JOIN revenue_combiner ON intermediate.customer_mod = revenue_combiner.customer_id
GROUP BY customer_id

In this example, we define a combiner function that groups the data by the remainder of the customer ID when divided by 10. This ensures that each group has roughly the same number of customers, which helps to distribute the data evenly across the nodes.

We then modify the original query to use the combiner by first computing the revenue for each customer and grouping the data by the remainder of the customer ID when divided by 10. We then join the intermediate data with the combiner data and compute the total revenue for each customer.

By using the combiner, we can reduce the amount of data being shuffled over the network and improve the performance of the query, especially in the case of data skewness.

Handling skew with salting

root
 |-- transaction_id: integer (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- price: double (nullable = true)

We can use the following SQL query to compute the total revenue generated by each customer:

SELECT customer_id, SUM(quantity * price) AS revenue
FROM transactions
GROUP BY customer_id

To address this issue, we can use salting to distribute the data evenly across the nodes. Salting involves adding a random prefix to each key to distribute the data more evenly.Here’s an example of how we can use salting in Databricks to handle data skewness:

-- Define the salt value
SET salt = FLOOR(RAND() * 1000000)

-- Use the salt in the query
SELECT customer_id, SUM(quantity * price) AS revenue
FROM (
  SELECT CONCAT(CAST(FLOOR(RAND() * 100) AS STRING), CAST(customer_id AS STRING)) AS customer_key,
         quantity, price
  FROM transactions
) intermediate
GROUP BY customer_key % 100, customer_id

In this example, we define a random salt value and add it to a random prefix generated for each customer key. The prefix is generated using the RAND() function and FLOOR() function to generate a random integer between 0 and 99.

We then group the data by the remainder of the salted customer key when divided by 100 and the original customer ID. This ensures that the data is distributed evenly across the nodes, even in the case of data skewness.

By using salting, we can reduce the amount of data being handled by each worker node and improve the performance of the query, especially in the case of data skewness.

Hope you like this blog. Please refer to my other Databricks performance optimization blogs:
Boost Databricks Performance for Maximum Results
The Fast Lane to Big Data Success: Mastering Databricks Performance Optimization
Turbocharge Your Data: The Ultimate Databricks Performance Optimization Guide