Storing and querying hierarchical data in CrateDB

CrateDB provides a great way to store and query large amounts of data quickly and efficiently. But, how do you store data with hierarchical links between records, such as a file system?

In this article, we will look at how to store and query hierarchical data in CrateDB and illustrate our solutions with a concrete example.

What is hierarchical data?

Hierarchical data is characterized by the parent-child relationship. This means that one child has only one parent while a parent may have one or more children. A typical example is directory relationships in the file system. A directory may have one parent directory and one or more child directories. We store the information about each directory, but what we need to additionally model in a database is the hierarchical relationship between entries.

Representing hierarchical data in CrateDB

CrateDB does not support recursive queries, but CrateDB offers relatively simple solutions for storing and querying hierarchical data. In the following sections, we will discuss two solutions for handling hierarchical structures in CrateDB based on a full-text analyzer and user-defined function.

Fulltext analyzer

CrateDB uses analyzers for creating full-text indexes. Analyzers take the content of the field and split it into tokens based on the specified tokenizer. CrateDB comes with several pre-built analyzers and tokenizers. To store and query hierarchical data we will create an analyzer based on the path_hierarchy tokenizer, which is illustrated by the following statement:

CREATE ANALYZER path_hierarchy_analyzer (
	TOKENIZER path_hierarchy
);

Path hierarchy tokenizer takes the input in the following format:

/usr/lib/system

and produces the following tokens:

/usr
/usr/lib
/usr/lib/system

User-defined function (UDF)

CrateDB supports the definition of user-defined functions which enables more flexibility for further data processing. Functions can be written in the JavaScript language and must have a name that is different from built-in functions available in CrateDB. Instead of using the path_hierarchy tokenizer, the following UDF can be used to parse inserted path into subpaths:

CREATE OR REPLACE FUNCTION path_hierarchy(full_path TEXT, tokenzier TEXT)
RETURNS ARRAY(STRING)
LANGUAGE JAVASCRIPT
AS
'function path_hierarchy(full_path, tokenizer) {
    let splitted = full_path.split(tokenizer).filter(x => x != "");
    return splitted.map((x,i) => tokenizer + splitted.slice(0,i+1).join(tokenizer))
}';

There are no semantic differences between the two solutions: considering the example from above, the user-defined function produces the same output for a given input.

Querying hierarchical data

To illustrate how to query hierarchical data, let’s create a table for storing some filesystem information. To generate subpaths we will use an analyzer with the path_hierarchy tokenizer.

CREATE TABLE system (
  name TEXT,
  path TEXT,
  properties OBJECT,
  INDEX idx_path USING FULLTEXT (path) WITH (analyzer = 'path_hierarchy_analyzer')
);

The idx_path is generated full-text index using the above-defined analyzer. Now, let’s insert sample data:

INSERT INTO system (name, path, properties) VALUES 
  ('bin', '/usr/bin', '{"size": 300}'),
  ('lib', '/usr/lib', '{"size": 500}'),
  ('lib', '/usr/lib/system', '{"size": 300}'),
  ('lib', '/usr/lib/introspection', '{"size": 200}'),
  ('lib', '/usr/lib/log', '{"size": 0}'),
  ('lib', '/usr/lib/ruby', '{"size": 100}');

Full-text index on a path column enables querying subpaths. The following query selects all children for the /usr/lib path:

SELECT * FROM system WHERE idx_path = '/usr/lib';

Another way to generate subpaths is to use the path_hierarchy user-defined function. In this case, we should first generate a new column as follows:

ALTER TABLE system ADD COLUMN subpaths GENERATED ALWAYS AS path_hierarchy("path", '/');

As a result, the subpaths column contains an array of all subpaths generated from the path value. For instance, for /usr/lib/system path the generated column will contain the following values:

/usr
/usr/lib
/usr/lib/system

It is important to mention that adding generated columns is possible only on an empty table. In this example, make sure that the table doesn’t include any records if you want to extend it with a new column.

Finally, to query all children for the /usr/lib path use the ANY operator on the subpaths column. The statement looks like this:

SELECT * FROM system WHERE '/usr/lib' = ANY(subpaths) LIMIT 100;

Summary

In this short tutorial, we illustrated how to store and query hierarchical data in CrateDB. There are at least two possible solutions: a full-text analyzer with path_hierarchy tokenizer or a user-defined function. Although both solutions work well with hierarchical data, consider that UDFs have to be deterministic: meaning that they must always return the same result value when called with the same argument values. The reason is that CrateDB might cache results of a UDF and reuse values if the function is called with the same arguments.

Storing and querying hierarchical data is a relatively simple task in CrateDB. If you liked this tutorial and want to learn more about CrateDB, visit our official documentation and check other tutorials in the community.

5 Likes