Modeling Hierarchical Tree Data in PostgreSQL

March 4, 2022

(updated March 2, 2024)

Database

Background

You have completed 12,841 Medium questions on Leetcode and can write DFS and BFS blindfolded. You know all the tricks in the book. Your arborist skills seemingly know no bounds. But have you ever tried to store one of your trees into a database?

In this article, we go over some of the different options for storing a tree (or any kind of hierarchical data structure) into a database and more specifically PostgreSQL. If you are not using Postgres, note that the bulk of the concepts would also apply to other SQL databases.

Update:I wrote a new article that goes into more detail about which approach to pick depending on the use case.

Published on March 4, 2022
Do's and Don'ts of Storing Large Trees in PostgreSQL

In this article, we use the example of a geographical dataset containing a hierarchy of countries, states and cities.

For example, our hierarchical dataset would be a big tree where the root is the World, and its descendants are trees like this:

        United States
         /          \
     Texas         California
     /   \           /     \
    /     \         /       \
   /       \       /         \
Dallas  Austin  San Fran.   Los Angeles

The article covers:

  • Static tree - hardcoding each node type without much abstraction or generic logic. You need to know your tree structure ahead of time for this one.
  • Dynamic tree with recursive Common Table Expression (CTE) queries - each node has a foreign key pointing to its parent and we use CTE queries to query subtrees, find nodes and whatnot.
  • Label tree (more commonly referred to as ltree) - each node has a materialized path from the root to itself. E.g. World/USA/Texas/Dallas would be the path for node Dallas.

Hierarchical data

Hierarchical data is common. Product categories nested into one another. Sneakers are footwear which are apparel. German movies are European movies and can further be split up into Comedy, Fiction, etc. Another example would be organizational charts with your boss at the top and you further down (馃様) but hopefully not at the very bottom of the pyramid.

Let's take the example of a simplistic geographical administrative tree that represents a hierarchy of countries, states and cities. We could throw in continents as well, but the idea is the same so for the sake of brevity, we leave continents out.

Approach #1: Static tree schema

If we are confident that our tree won't require the different node types to be treated generically, and that we won't need dynamic traversals or unbounded depth changes, we can create one table per node type.

Here, we can create one table for countries, one table for states, one table for cities. If we had continents, we would need to make another table for that node type.

SQL
CREATE DATABASE world WITH ENCODING 'UTF8';

CREATE TABLE countries (
	name 	   TEXT PRIMARY KEY,
    population BIGINT
);

CREATE TABLE states (
	name 	    TEXT PRIMARY KEY,
    population  BIGINT,
    country 	TEXT REFERENCES countries(name)
);

CREATE TABLE cities (
	name 	    TEXT PRIMARY KEY,
    population  BIGINT,
    state 	    TEXT REFERENCES states(name)
);

INSERT INTO countries(name, population) VALUES
    ('United States', 329500000),
    ('Australia', 25690000),
    ('Spain', 47350000);

INSERT INTO states (name, country, population) VALUES
    ('Texas', 'United States', 29000000),
    ('California', 'United States', 39510000),
    ('New South Wales', 'Australia', 8166000),
    ('Victoria', 'Australia', 6681000),
    ('Comunidad de Madrid', 'Spain', 6642000);

INSERT INTO cities (name, state, population) VALUES
    ('Dallas', 'Texas', 1331000),
    ('Austin', 'Texas', 950807),
    ('Houston', 'Texas', 2310000),
    ('Los Angeles', 'California', 3967000),
    ('San Francisco', 'California', 874961),
    ('San Diego', 'California', 1410000),
    ('Sydney', 'New South Wales', 5312000),
    ('Newcastle', 'New South Wales', 322278),
    ('Melbourne', 'Victoria', 5078000),
    ('Geelong', 'Victoria', 253269),
    ('Madrid', 'Comunidad de Madrid', 3223000),
    ('M贸stoles', 'Comunidad de Madrid', 207095);

One nice thing we get out of the box with this design is that integrity constraints are easy to enforce. states must belong to a country and cities must belong to a state. Not much ambiguity there.

But this design has limitations. If you changed your mind tomorrow and reckoned that you actually want to support neighborhoods, streets or some other hierarchical connections between different types of nodes, then you would need to make changes to your table schemas and / or add new tables.

If you are unlucky, you might find out that your problem domain is not as static as you thought it was, For example, with administrative geographical divisions, you might find out that there are disputed territories that are claimed by multiple countries. With our current design, we are unable to connect a city to multiple states so we would be stuck or have to somehow work around the issue with a bunch of little hacks.

You also have the potential downside of having redundancy requiring duplicated application code. Here, we have the population field that is duplicated across every table - in a real-world scenario, we could have dozens of overlapping columns. Shared attributes across different models that serve a similar purpose could be a hint that extracting a base table might be a good idea.

Another major limitation is the lack of polymorphism. By that I mean that each table is treated as an entirely distinct entity and that their common qualities are not taken advantage of. Imagine you are building a location picker that lets users pick whatever location granularity they want. For example, you let them pick any node that is a country or a city. You would have to UNION every table and return that to the client through your API.

If the user picks say Melbourne and sends the primary key for that code to your application backend, your server will need to somehow figure out which table that primary key is from. For example, if Melbourne has id = 36; the frontend code sends { id: 36 } to your backend, and now your backend has to figure out which table that id is from (countries, states, cities) or rely on client-side provided type information, which would still need to be checked by your backend.

Query: Get ancestors

It is relatively easy to get the entire hierarchy from root to node using a combination of JOIN and UNION sets. However, you can already get a feel that query complexity (or at least its length) will increase proportionally to the number of tables because there is nothing dynamic going on there. You need to type out the structure you expect for each level explicitly.

You can see how this could become an issue if you had a tree of depth = 20 or with an exploding depth that is unbounded. That gets unwieldy really quickly.

SQL
SELECT
    cities.name,
    ARRAY[
        countries.name,
        states.name,
        cities.name
    ] AS ancestors
	FROM cities
	LEFT JOIN states
	ON cities.state = states."name"
	LEFT JOIN countries
	ON states.country = countries."name"
UNION ALL
SELECT
	states.name,
	ARRAY[
	    countries.name,
	    states.name
    ] AS ancestors
	FROM states
	LEFT JOIN countries
	ON states.country = countries.name
UNION ALL
SELECT
	countries.name,
	ARRAY[
	    countries.name
    ] AS ancestors
	FROM countries;

Query: Get descendants

To get descendant nodes given a country or state root node, we need to create the hierarchical paths and then check if it contains the root node. There are probably better ways of doing that (probably involving doing away with the ARRAY). You could also create a MATERIALIZED VIEW to make the set easier to query (and more performant).

But you get the point, the query becomes complex and this is all fairly headache-inducing. If you want to be able to use tree operations and traversal queries, then you are better off modeling the data as an actual tree.

SQL
SELECT
	countries.name AS country,
	h.name AS descendant,
	h.fqn
FROM countries
LEFT JOIN (
	SELECT
	    cities.name,
	    ARRAY[
	        countries.name,
	        states.name,
	        cities.name
	    ] AS fqn
		FROM cities
		LEFT JOIN states
		ON cities.state = states."name"
		LEFT JOIN countries
		ON states.country = countries."name"
	UNION ALL
	SELECT
		states.name,
		ARRAY[
		    countries.name,
		    states.name
	    ] AS fqn
		FROM states
		LEFT JOIN countries
		ON states.country = countries.name
	UNION ALL
	SELECT
		countries.name,
		ARRAY[
		    countries.name
	    ] AS fqn
		FROM countries
) AS h
ON countries.name = ANY(h.fqn);

Approach #2: Dynamic tree with Recursive Common Table Expression (CTE)

In this approach, we extract a base type (location) and model the data as a generic tree in which each node has a reference to its parent node (or NULL if they are root nodes).

For example, Dallas has a foreign key to Texas, which itself has a foreign key to United States, which has no parent node (parent = NULL).

SQL
CREATE DATABASE world WITH ENCODING 'UTF8';

CREATE TYPE admin_level AS ENUM ('country', 'state', 'city');

CREATE TABLE locations (
	name 	      TEXT PRIMARY KEY,
    population    BIGINT,
    location_type admin_level,
    parent        TEXT REFERENCES locations(name)
);

INSERT INTO locations(name, parent, population, location_type) VALUES
    ('United States', NULL, 329500000, 'country'),
    ('Australia', NULL, 25690000, 'country'),
    ('Spain', NULL, 47350000, 'country'),
    ('Texas', 'United States', 29000000, 'state'),
    ('California', 'United States', 39510000, 'state'),
    ('New South Wales', 'Australia', 8166000, 'state'),
    ('Victoria', 'Australia', 6681000, 'state'),
    ('Comunidad de Madrid', 'Spain', 6642000, 'state'),
    ('Dallas', 'Texas', 1331000, 'city'),
    ('Austin', 'Texas', 950807, 'city'),
    ('Houston', 'Texas', 2310000, 'city'),
    ('Los Angeles', 'California', 3967000, 'city'),
    ('San Francisco', 'California', 874961, 'city'),
    ('San Diego', 'California', 1410000, 'city'),
    ('Sydney', 'New South Wales', 5312000, 'city'),
    ('Newcastle', 'New South Wales', 322278, 'city'),
    ('Melbourne', 'Victoria', 5078000, 'city'),
    ('Geelong', 'Victoria', 253269, 'city'),
    ('Madrid', 'Comunidad de Madrid', 3223000, 'city'),
    ('M贸stoles', 'Comunidad de Madrid', 207095, 'city');
We are using an ENUM to represent the type of each node here, but we could have also used a foreign key to another location_types lookup table.

This approach is much closer to a tree data structure and allows for much more dynamic traversal operations. It also allows for some form of polymorphism because every location shares a common table.

It allows for recursive operations using Common Table Expression (CTE). With the right indexes in place, recursive queries perform very well even over large datasets.

Some complexity results from extracting a base table as it may require additional work to model subtype-specific fields (e.g. country may need a flag or national_anthem attribute for example).

While the ins and outs of modeling inheritance in a database is beyond the scope of this article, here are some references on how you could go about modeling subtype-specific attributes:

  • Single table inheritance. Add the fields of all location types to locations table and set the fields that are not relevant to the specific node to NULL.
  • Concrete table inheritance. Create an additional table for each location type and possibly duplicate some data that is also present in locations.
  • Class table inheritance. Common fields are stored in the supertype (locations) while type-specific fields are stored in their type-specific table, mapped through the use of a shared primary key. An implementation variant of this is called exclusive arc.
  • JSONB column. This is just a column that contains binary-encoded JSON. You serialize the location and store it in there. Not an ideal solution as it could severely restrict querying performance and flexibility on the properties stored in that column.

Query: Get ancestors

We use a recursive CTE query. For each level in the tree, we append the ancestor we find into an array and go up one level until we hit the root node.

SQL
WITH RECURSIVE locations_cte(name, parent, ancestors) AS (
    SELECT
    	locations.name,
    	locations.parent,
    	ARRAY [locations.name]
        FROM locations
        WHERE locations.parent IS NULL
    UNION ALL
    SELECT
    	locations.name,
    	locations.parent,
    	array_append(locations_cte.ancestors, locations.name)
        FROM locations_cte,
             locations
        WHERE locations.parent = locations_cte.name
)
SELECT name, ancestors
FROM locations_cte;

This schema feels a lot more natural as it opens up the possibility for tree operations.

This modeling of the tree also naturally improves flexibility and dynamism. Adding a new layer or sibling node types only requires adding a new location_type.

Query: Get descendants

We use a recursive CTE query here too, which gives us rows that are then easy to query - if you want the descendants of say Australia, then just add WHERE root = 'Australia' and you are good to go.

SQL
WITH RECURSIVE locations_cte(root, name, parent) AS (
    SELECT
        name AS root,
        name,
        parent
        FROM locations
        WHERE parent IS NULL
    UNION
        SELECT
        	locations_cte.root,
        	locations.name,
        	locations.parent
            FROM locations
            JOIN locations_cte
            ON locations.parent = locations_cte.name
)
SELECT *
FROM locations_cte;

Approach #3: Label tree (ltree) materialized path

This approach makes use of a label tree, also known as ltree.

This ltree is an implementation of a materialized path which describes the path from a root to a node on a tree. This technique reduces the complexity of tree traversal queries to simple wildcard sort of expressions. Queries involving ltree operations are called lquery.

For example, the expression Texas.* matches any path that starts with the label Texas; whereas *.Texas.* will match any path that contains Texas. There are a ton of other operators that you can use - refer to the Postgres documentation on ltree.

The ltree also supports B-tree indexing - as well as GiST.

SQL
CREATE DATABASE world_tree WITH ENCODING 'UTF8';

CREATE TYPE admin_level AS ENUM ('country', 'state', 'city');

CREATE TABLE world (
	name 	      TEXT PRIMARY KEY,
	label 		  TEXT CHECK (label ~* '^[A-Za-z0-9_]{1,255}$'),
    population    BIGINT,
    location_type admin_level,
    path          ltree
);

CREATE INDEX path_idx ON world USING btree(path);

INSERT INTO world(name, label, population, location_type, path) VALUES
    ('United States', 'United_States', 329500000, 'country', 'United_States'),
    ('Australia', 'Australia', 25690000, 'country', 'Australia'),
    ('Spain', 'Spain', 47350000, 'country', 'Spain'),
    ('Texas', 'Texas', 29000000, 'state', 'United_States.Texas'),
    ('California', 'California', 39510000, 'state', 'United_States.California'),
    ('New South Wales', 'New_South_Wales', 8166000, 'state', 'Australia.New_South_Wales'),
    ('Victoria', 'Victoria', 6681000, 'state', 'Australia.Victoria'),
    ('Comunidad de Madrid', 'Comunidad_de_Madrid', 6642000, 'state', 'Spain.Comunidad_de_Madrid'),
    ('Dallas', 'Dallas', 1331000, 'city', 'United_States.Texas.Dallas'),
    ('Austin', 'Austin', 950807, 'city', 'United_States.Texas.Austin'),
    ('Houston', 'Houston', 2310000, 'city', 'United_States.Texas.Houston'),
    ('Los Angeles', 'Los_Angeles', 3967000, 'city', 'United_States.California.Los_Angeles'),
    ('San Francisco', 'San_Francisco', 874961, 'city', 'United_States.California.San_Francisco'),
    ('San Diego', 'San_Diego', 1410000, 'city', 'United_States.California.San_Diego'),
    ('Sydney', 'Sydney', 5312000, 'city', 'Australia.New_South_Wales.Sydney'),
    ('Newcastle', 'Newcastle', 322278, 'city', 'Australia.New_South_Wales.Newcastle'),
    ('Melbourne', 'Melbourne', 5078000, 'city', 'Australia.Victoria.Melbourne'),
    ('Geelong', 'Geelong', 253269, 'city', 'Australia.Victoria.Geelong'),
    ('Madrid', 'Madrid', 3223000, 'city', 'Spain.Comunidad_de_Madrid.Madrid'),
    ('M贸stoles', 'Mostoles', 207095, 'city', 'Spain.Comunidad_de_Madrid.Mostoles');

The main limitation of this approach is that the ltree assumes that there is only one path from the root to any given node. If your tree is more akin to a funky graph than to a tree.

Although the design can be adapted to an arbitrary graph by extracting the path field out of the table and into a closure table with columns PK(node), path such that each node can have multiple paths there. This would require to execute the lquery on the closure table and then JOIN the matching paths with the actual locations table.

Another (worse) option would be to use the path as a primary key for a node, such that each node would be duplicated as many times as it has paths. But that unnecessarily adds redundancy and risks getting nodes incorrectly updated.

Another thing to consider is that the application code must take care of building the path. Building the path requires having access to the entire tree, which limits the possibility of executing isolated operations on parts of the tree.

Finally, if paths are likely to change frequently in your dataset, the materialized path approach may not be the best for you as you will need to update every node in the subtree whenever one of its ancestors changes.

Query: Get ancestors

Query complexity using an ltree is significantly reduced. lquery semantics takes a bit of practice to get used to but is not too crazy.

In this query, we are simply finding the path for the node whose label = 'Madrid'. Then we just find all nodes whose path is an ancestor to that path (e.g. A.B: A is an ancestor to node with path A.B).

SQL
SELECT name, path
FROM world
WHERE path @> (
    SELECT path
    FROM world
    WHERE label = 'Madrid' -- node whose ancestors to find
);

Query: Get descendants

The query is almost the same as the last one, except for the <@ operator, which matches descendants rather than ancestors.

SQL
SELECT name, path
FROM world
WHERE path <@ (
    SELECT path
    FROM locations_with_ltree
    WHERE label = 'Victoria'
);

Summary

There are many ways to store a tree structure into a database, none is ideal and you will have to see what makes the most sense given your use case. Good database design is what will work for you.

(Another technique I have not covered here but that I might add in the future is the technique of the Nested Sets technique.)