In this tutorial, you will learn how to use window functions in CrateDB.
CrateDB is an open-source distributed database designed for scalability and performance. One of the features of CrateDB is its support for window functions. Unlike regular aggregate functions like
COUNT, which aggregate data across the entire result set, window functions operate on a subset of rows defined by the window specification. Window functions help perform calculations that depend on the rows’ order or involve comparisons with other rows in the same result set. Some everyday use cases for window functions include calculating running totals and averages and ranking rows based on their values.
In this tutorial, you will learn how to
- define a window function
- use partitioning and ordering to specify a window
- perform standard calculations and aggregations
So let’s dive in!
Defining a Window Function
To declare a function a window function, use the
OVER clause to define the window, or in other words, define which rows will take part in the function computation. If the
OVER clause is left empty, the window is defined as all the rows in the result set.
For example, let’s use
COUNT(*) as a window function in an
employees database. If you leave the
OVER clause empty,
COUNT(*) will be computed over the entire database of
SELECT first_name, last_name, designation, COUNT(*) OVER() FROM employees ORDER BY last_name LIMIT 100;
PARTITION BY is a clause that divides the result set into partitions or groups based on the values of one or more columns. The window function is then applied to each partition separately rather than the entire result set.
Let’s say you want to calculate the number of employees with the same designation. Then, define the
COUNT(*) computation window as a partition with the same designation, using
OVER(PARTITION BY designation).
SELECT first_name, last_name, designation, COUNT(*) OVER(PARTITION BY designation) FROM employees ORDER BY last_name LIMIT 100;
ORDER BY, you can specify criteria for processing the rows within each partition, such as ascending or descending values of a particular column.
For instance, say you want to get the employee’s salary rank among people with the same job designation. You can keep the same designation partition as the previous example, but now order it by salary and use the
DENSE_RANK() function to get the employee’s salary rank in this window.
Moreover, you can define your window using the
WINDOW clause. Using named windows in your queries simplifies your code, allows you to reuse window specifications across multiple queries, and improves query readability.
SELECT first_name, last_name, designation, salary, DENSE_RANK() OVER w FROM employees WINDOW w as (PARTITION BY designation ORDER BY salary DESC) ORDER BY last_name LIMIT 100;
Performing standard calculations and aggregations
Now that you are familiar with defining windows, let’s have an overview of the functions you can use as window functions and then check out some practical examples.
Aggregate functions calculate aggregate values over the rows defined by the window. These functions include:
SUM: Calculates the sum of a column over the set of rows in the window.
AVG: Calculates the average of a column over the set of rows in the window.
MIN: Returns the minimum value of a column over the set of rows in the window.
MAX: Returns the maximum value of a column over the set of rows in the window.
COUNT: Counts the number of rows in the window.
MAX_BY: allows you to find the minimum or maximum value in a given column based on the values in another column. Check out our blog post introducing this feature in CrateDB 5.2
Check out our Aggregation documentation for more details on all supported functions.
On the other hand, general-purpose functions perform various operations on the set of rows defined by the window. These functions include:
ROW_NUMBER: Returns the current row number within its window.
FIRST_VALUE: Returns the argument value to the window’s first row.
LAST_VALUE: Returns the argument value to the window’s last row.
NTH_VALUE: Returns the argument value to the window’s nth row, or
NULLif the nth row does not exist.
RANK: Returns the rank of each row in the window based on the values in the
ORDER BYclause. Subsequent tied rows get the same rank, and the potential rank of the next row gets incremented by each tie. Because of this, ranks may not be sequential.
DENSE_RANK: Returns the dense rank of each row in the window based on the values in the
ORDER BYclause. Unlike
RANK, it always returns sequential rank values.
LEAD: Returns the value of a column from a row n rows ahead of the current row in the window.
LAG: Returns the value of a column from a row that is n rows behind the current row in the window.
Check out our window-functions documentation for more details on these functions.
- Year-over-Year growth
Consider a table with the
year and the
sales for a company. Suppose you want to calculate the previous and next sales and the year-over-year growth in sales. Here, the
LAG function comes in handy to figure out the previous sales, the
LEAD function returns the following sales, and you can use the
LAG function in calculating your YoY growth.
SELECT year, sales, LAG(sales) OVER w prev_sales, LEAD(sales) OVER w next_sales, (sales / LAG(sales) OVER w - 1)*100 as year_over_year_growth FROM (VALUES (2018, 100.5), (2019, 150.2), (2020, 200.7), (2021, 180.1), (2022, 220)) AS sales (year, sales) WINDOW w AS (ORDER BY year) LIMIT 100;
- Percentage of sales for each product in each year
To calculate the percentage of sales for each product each year, you can use the
SUM function to calculate the total sales on each year window and then divide the product’s sales by the total sales for that year.
- Percentage of total sales for each product over the entire period
Similarly, you can use
SUM to calculate the total sales of that product and divide the product’s sales by this result.
SELECT year, product, sales, (sales / SUM(sales) OVER(PARTITION BY year))*100 AS percentage_of_sales_on_year, (sales / SUM(sales) OVER(PARTITION BY product))*100 AS percentage_of_total_sales_of_product FROM (VALUES (2018, 'A', 110.5), (2019, 'A', 120), (2020, 'A', 124.2), (2018, 'B', 70), (2019, 'B', 68.1), (2020, 'B', 65.9), (2018, 'C', 100), (2019, 'C', 101.3), (2020, 'C', 102)) AS product_sales (year, product, sales) LIMIT 100;
In summary, window functions are a valuable tool for data analysis in CrateDB. By defining partitions and orderings, you can perform complex calculations and aggregations over subsets of data, making it easier to gain insights and understand patterns in our data.