Window functions in CrateDB

In this tutorial, you will learn how to use window functions in CrateDB.

CrateDB is an open-source distributed database designed for scalability and performance. One of the features of CrateDB is its support for window functions. Unlike regular aggregate functions like SUM and COUNT, which aggregate data across the entire result set, window functions operate on a subset of rows defined by the window specification. Window functions help perform calculations that depend on the rows’ order or involve comparisons with other rows in the same result set. Some everyday use cases for window functions include calculating running totals and averages and ranking rows based on their values.

In this tutorial, you will learn how to

  • define a window function
  • use partitioning and ordering to specify a window
  • perform standard calculations and aggregations

So let’s dive in!

Defining a Window Function

To declare a function a window function, use the OVER clause to define the window, or in other words, define which rows will take part in the function computation. If the OVER clause is left empty, the window is defined as all the rows in the result set.

For example, let’s use COUNT(*) as a window function in an employees database. If you leave the OVER clause empty, COUNT(*) will be computed over the entire database of employees.

SELECT first_name,
       last_name,
       designation,
       COUNT(*) OVER()
FROM employees
ORDER BY last_name
LIMIT 100;

PARTITION BY

PARTITION BY is a clause that divides the result set into partitions or groups based on the values of one or more columns. The window function is then applied to each partition separately rather than the entire result set.

Let’s say you want to calculate the number of employees with the same designation. Then, define the COUNT(*) computation window as a partition with the same designation, using OVER(PARTITION BY designation).

SELECT first_name,
       last_name,
       designation,
       COUNT(*) OVER(PARTITION BY designation)
FROM employees
ORDER BY last_name
LIMIT 100;

ORDER BY

With ORDER BY, you can specify criteria for processing the rows within each partition, such as ascending or descending values of a particular column.

For instance, say you want to get the employee’s salary rank among people with the same job designation. You can keep the same designation partition as the previous example, but now order it by salary and use the DENSE_RANK() function to get the employee’s salary rank in this window.

Moreover, you can define your window using the WINDOW clause. Using named windows in your queries simplifies your code, allows you to reuse window specifications across multiple queries, and improves query readability.

SELECT first_name,
       last_name,
       designation,
       salary,
       DENSE_RANK() OVER w
FROM employees
WINDOW w as (PARTITION BY designation ORDER BY salary DESC)
ORDER BY last_name
LIMIT 100;

Performing standard calculations and aggregations

Now that you are familiar with defining windows, let’s have an overview of the functions you can use as window functions and then check out some practical examples.

Aggregations

Aggregate functions calculate aggregate values over the rows defined by the window. These functions include:

  • SUM: Calculates the sum of a column over the set of rows in the window.
  • AVG: Calculates the average of a column over the set of rows in the window.
  • MIN: Returns the minimum value of a column over the set of rows in the window.
  • MAX: Returns the maximum value of a column over the set of rows in the window.
  • COUNT: Counts the number of rows in the window.
  • MIN_BY, MAX_BY: allows you to find the minimum or maximum value in a given column based on the values in another column. Check out our blog post introducing this feature in CrateDB 5.2

Check out our Aggregation documentation for more details on all supported functions.

General-purpose functions

On the other hand, general-purpose functions perform various operations on the set of rows defined by the window. These functions include:

  • ROW_NUMBER: Returns the current row number within its window.
  • FIRST_VALUE: Returns the argument value to the window’s first row.
  • LAST_VALUE: Returns the argument value to the window’s last row.
  • NTH_VALUE: Returns the argument value to the window’s nth row, or NULL if the nth row does not exist.
  • RANK: Returns the rank of each row in the window based on the values in the ORDER BY clause. Subsequent tied rows get the same rank, and the potential rank of the next row gets incremented by each tie. Because of this, ranks may not be sequential.
  • DENSE_RANK: Returns the dense rank of each row in the window based on the values in the ORDER BY clause. Unlike RANK, it always returns sequential rank values.
  • LEAD: Returns the value of a column from a row n rows ahead of the current row in the window.
  • LAG: Returns the value of a column from a row that is n rows behind the current row in the window.

Check out our window-functions documentation for more details on these functions.

Examples

  • Year-over-Year growth

Consider a table with the year and the sales for a company. Suppose you want to calculate the previous and next sales and the year-over-year growth in sales. Here, the LAG function comes in handy to figure out the previous sales, the LEAD function returns the following sales, and you can use the LAG function in calculating your YoY growth.

SELECT
  year,
  sales,
  LAG(sales) OVER w prev_sales,
  LEAD(sales) OVER w next_sales,
  (sales / LAG(sales) OVER w - 1)*100 as year_over_year_growth
FROM (VALUES
     (2018, 100.5),
     (2019, 150.2),
     (2020, 200.7),
     (2021, 180.1),
     (2022, 220))
AS sales (year, sales) 
WINDOW w AS (ORDER BY year) 
LIMIT 100;

  • Percentage of sales for each product in each year

To calculate the percentage of sales for each product each year, you can use the SUM function to calculate the total sales on each year window and then divide the product’s sales by the total sales for that year.

  • Percentage of total sales for each product over the entire period

Similarly, you can use SUM to calculate the total sales of that product and divide the product’s sales by this result.

SELECT 
  year, 
  product, 
  sales,
  (sales / SUM(sales) OVER(PARTITION BY year))*100 AS percentage_of_sales_on_year,
  (sales / SUM(sales) OVER(PARTITION BY product))*100 AS percentage_of_total_sales_of_product
FROM (VALUES
     (2018, 'A', 110.5),
     (2019, 'A', 120),
     (2020, 'A', 124.2),
     (2018, 'B', 70),
     (2019, 'B', 68.1),
     (2020, 'B', 65.9),
     (2018, 'C', 100),
     (2019, 'C', 101.3),
     (2020, 'C', 102))
AS product_sales (year, product, sales) LIMIT 100;

Wrap Up

In summary, window functions are a valuable tool for data analysis in CrateDB. By defining partitions and orderings, you can perform complex calculations and aggregations over subsets of data, making it easier to gain insights and understand patterns in our data.

Reference

Window functions

Aggregation

4 Likes