Modeling Hierarchical Tree Data in PostgreSQL
March 4, 2022
(updated March 2, 2024)
Background
You have completed 12,841 Medium questions on Leetcode and can write DFS and BFS blindfolded. You know all the tricks in the book. Your arborist skills seemingly know no bounds. But have you ever tried to store one of your trees into a database?
In this article, we go over some of the different options for storing a tree (or any kind of hierarchical data structure) into a database and more specifically PostgreSQL. If you are not using Postgres, note that the bulk of the concepts would also apply to other SQL databases.
Update:I wrote a new article that goes into more detail about which approach to pick depending on the use case.
In this article, we use the example of a geographical dataset containing a hierarchy of countries, states and cities.
For example, our hierarchical dataset would be a big tree where the root is the World, and its descendants are trees like this:
United States
/ \
Texas California
/ \ / \
/ \ / \
/ \ / \
Dallas Austin San Fran. Los AngelesThe article covers:
- Static tree - hardcoding each node type without much abstraction or generic logic. You need to know your tree structure ahead of time for this one.
- Dynamic tree with recursive Common Table Expression (CTE) queries - each node has a foreign key pointing to its parent and we use CTE queries to query subtrees, find nodes and whatnot.
- Label tree (more commonly referred to as
ltree) - each node has a materialized path from the root to itself. E.g.World/USA/Texas/Dallaswould be thepathfor nodeDallas.
Hierarchical data
Hierarchical data is common. Product categories nested into one another. Sneakers are footwear which are apparel. German movies are European movies and can further be split up into Comedy, Fiction, etc. Another example would be organizational charts with your boss at the top and you further down (馃様) but hopefully not at the very bottom of the pyramid.
Let's take the example of a simplistic geographical administrative tree that represents a hierarchy of countries, states and cities. We could throw in continents as well, but the idea is the same so for the sake of brevity, we leave continents out.
Approach #1: Static tree schema
If we are confident that our tree won't require the different node types to be treated generically, and that we won't need dynamic traversals or unbounded depth changes, we can create one table per node type.
Here, we can create one table for countries, one table for states, one table for cities. If we had continents, we would need to make another table for that node type.
CREATE DATABASE world WITH ENCODING 'UTF8';
CREATE TABLE countries (
name TEXT PRIMARY KEY,
population BIGINT
);
CREATE TABLE states (
name TEXT PRIMARY KEY,
population BIGINT,
country TEXT REFERENCES countries(name)
);
CREATE TABLE cities (
name TEXT PRIMARY KEY,
population BIGINT,
state TEXT REFERENCES states(name)
);
INSERT INTO countries(name, population) VALUES
('United States', 329500000),
('Australia', 25690000),
('Spain', 47350000);
INSERT INTO states (name, country, population) VALUES
('Texas', 'United States', 29000000),
('California', 'United States', 39510000),
('New South Wales', 'Australia', 8166000),
('Victoria', 'Australia', 6681000),
('Comunidad de Madrid', 'Spain', 6642000);
INSERT INTO cities (name, state, population) VALUES
('Dallas', 'Texas', 1331000),
('Austin', 'Texas', 950807),
('Houston', 'Texas', 2310000),
('Los Angeles', 'California', 3967000),
('San Francisco', 'California', 874961),
('San Diego', 'California', 1410000),
('Sydney', 'New South Wales', 5312000),
('Newcastle', 'New South Wales', 322278),
('Melbourne', 'Victoria', 5078000),
('Geelong', 'Victoria', 253269),
('Madrid', 'Comunidad de Madrid', 3223000),
('M贸stoles', 'Comunidad de Madrid', 207095);One nice thing we get out of the box with this design is that integrity constraints are easy to enforce. states must belong to a country and cities must belong to a state. Not much ambiguity there.
But this design has limitations. If you changed your mind tomorrow and reckoned that you actually want to support neighborhoods, streets or some other hierarchical connections between different types of nodes, then you would need to make changes to your table schemas and / or add new tables.
If you are unlucky, you might find out that your problem domain is not as static as you thought it was, For example, with administrative geographical divisions, you might find out that there are disputed territories that are claimed by multiple countries. With our current design, we are unable to connect a city to multiple states so we would be stuck or have to somehow work around the issue with a bunch of little hacks.
You also have the potential downside of having redundancy requiring duplicated application code. Here, we have the population field that is duplicated across every table - in a real-world scenario, we could have dozens of overlapping columns. Shared attributes across different models that serve a similar purpose could be a hint that extracting a base table might be a good idea.
Another major limitation is the lack of polymorphism. By that I mean that each table is treated as an entirely distinct entity and that their common qualities are not taken advantage of. Imagine you are building a location picker that lets users pick whatever location granularity they want. For example, you let them pick any node that is a country or a city. You would have to UNION every table and return that to the client through your API.
If the user picks say Melbourne and sends the primary key for that code to your application backend, your server will need to somehow figure out which table that primary key is from. For example, if Melbourne has id = 36; the frontend code sends { id: 36 } to your backend, and now your backend has to figure out which table that id is from (countries, states, cities) or rely on client-side provided type information, which would still need to be checked by your backend.
Query: Get ancestors
It is relatively easy to get the entire hierarchy from root to node using a combination of JOIN and UNION sets. However, you can already get a feel that query complexity (or at least its length) will increase proportionally to the number of tables because there is nothing dynamic going on there. You need to type out the structure you expect for each level explicitly.
You can see how this could become an issue if you had a tree of depth = 20 or with an exploding depth that is unbounded. That gets unwieldy really quickly.
SELECT
cities.name,
ARRAY[
countries.name,
states.name,
cities.name
] AS ancestors
FROM cities
LEFT JOIN states
ON cities.state = states."name"
LEFT JOIN countries
ON states.country = countries."name"
UNION ALL
SELECT
states.name,
ARRAY[
countries.name,
states.name
] AS ancestors
FROM states
LEFT JOIN countries
ON states.country = countries.name
UNION ALL
SELECT
countries.name,
ARRAY[
countries.name
] AS ancestors
FROM countries;Query: Get descendants
To get descendant nodes given a country or state root node, we need to create the hierarchical paths and then check if it contains the root node. There are probably better ways of doing that (probably involving doing away with the ARRAY). You could also create a MATERIALIZED VIEW to make the set easier to query (and more performant).
But you get the point, the query becomes complex and this is all fairly headache-inducing. If you want to be able to use tree operations and traversal queries, then you are better off modeling the data as an actual tree.
SELECT
countries.name AS country,
h.name AS descendant,
h.fqn
FROM countries
LEFT JOIN (
SELECT
cities.name,
ARRAY[
countries.name,
states.name,
cities.name
] AS fqn
FROM cities
LEFT JOIN states
ON cities.state = states."name"
LEFT JOIN countries
ON states.country = countries."name"
UNION ALL
SELECT
states.name,
ARRAY[
countries.name,
states.name
] AS fqn
FROM states
LEFT JOIN countries
ON states.country = countries.name
UNION ALL
SELECT
countries.name,
ARRAY[
countries.name
] AS fqn
FROM countries
) AS h
ON countries.name = ANY(h.fqn);Approach #2: Dynamic tree with Recursive Common Table Expression (CTE)
In this approach, we extract a base type (location) and model the data as a generic tree in which each node has a reference to its parent node (or NULL if they are root nodes).
For example, Dallas has a foreign key to Texas, which itself has a foreign key to United States, which has no parent node (parent = NULL).
CREATE DATABASE world WITH ENCODING 'UTF8';
CREATE TYPE admin_level AS ENUM ('country', 'state', 'city');
CREATE TABLE locations (
name TEXT PRIMARY KEY,
population BIGINT,
location_type admin_level,
parent TEXT REFERENCES locations(name)
);
INSERT INTO locations(name, parent, population, location_type) VALUES
('United States', NULL, 329500000, 'country'),
('Australia', NULL, 25690000, 'country'),
('Spain', NULL, 47350000, 'country'),
('Texas', 'United States', 29000000, 'state'),
('California', 'United States', 39510000, 'state'),
('New South Wales', 'Australia', 8166000, 'state'),
('Victoria', 'Australia', 6681000, 'state'),
('Comunidad de Madrid', 'Spain', 6642000, 'state'),
('Dallas', 'Texas', 1331000, 'city'),
('Austin', 'Texas', 950807, 'city'),
('Houston', 'Texas', 2310000, 'city'),
('Los Angeles', 'California', 3967000, 'city'),
('San Francisco', 'California', 874961, 'city'),
('San Diego', 'California', 1410000, 'city'),
('Sydney', 'New South Wales', 5312000, 'city'),
('Newcastle', 'New South Wales', 322278, 'city'),
('Melbourne', 'Victoria', 5078000, 'city'),
('Geelong', 'Victoria', 253269, 'city'),
('Madrid', 'Comunidad de Madrid', 3223000, 'city'),
('M贸stoles', 'Comunidad de Madrid', 207095, 'city');ENUM to represent the type of each node here, but we could have also used a foreign key to another location_types lookup table.This approach is much closer to a tree data structure and allows for much more dynamic traversal operations. It also allows for some form of polymorphism because every location shares a common table.
It allows for recursive operations using Common Table Expression (CTE). With the right indexes in place, recursive queries perform very well even over large datasets.
Some complexity results from extracting a base table as it may require additional work to model subtype-specific fields (e.g. country may need a flag or national_anthem attribute for example).
While the ins and outs of modeling inheritance in a database is beyond the scope of this article, here are some references on how you could go about modeling subtype-specific attributes:
- Single table inheritance. Add the fields of all location types to
locationstable and set the fields that are not relevant to the specific node toNULL. - Concrete table inheritance. Create an additional table for each location type and possibly duplicate some data that is also present in
locations. - Class table inheritance. Common fields are stored in the supertype (
locations) while type-specific fields are stored in their type-specific table, mapped through the use of a shared primary key. An implementation variant of this is called exclusive arc. - JSONB column. This is just a column that contains binary-encoded
JSON. You serialize thelocationand store it in there. Not an ideal solution as it could severely restrict querying performance and flexibility on the properties stored in that column.
Query: Get ancestors
We use a recursive CTE query. For each level in the tree, we append the ancestor we find into an array and go up one level until we hit the root node.
WITH RECURSIVE locations_cte(name, parent, ancestors) AS (
SELECT
locations.name,
locations.parent,
ARRAY [locations.name]
FROM locations
WHERE locations.parent IS NULL
UNION ALL
SELECT
locations.name,
locations.parent,
array_append(locations_cte.ancestors, locations.name)
FROM locations_cte,
locations
WHERE locations.parent = locations_cte.name
)
SELECT name, ancestors
FROM locations_cte;This schema feels a lot more natural as it opens up the possibility for tree operations.
This modeling of the tree also naturally improves flexibility and dynamism. Adding a new layer or sibling node types only requires adding a new location_type.
Query: Get descendants
We use a recursive CTE query here too, which gives us rows that are then easy to query - if you want the descendants of say Australia, then just add WHERE root = 'Australia' and you are good to go.
WITH RECURSIVE locations_cte(root, name, parent) AS (
SELECT
name AS root,
name,
parent
FROM locations
WHERE parent IS NULL
UNION
SELECT
locations_cte.root,
locations.name,
locations.parent
FROM locations
JOIN locations_cte
ON locations.parent = locations_cte.name
)
SELECT *
FROM locations_cte;Approach #3: Label tree (ltree) materialized path
This approach makes use of a label tree, also known as ltree.
This ltree is an implementation of a materialized path which describes the path from a root to a node on a tree. This technique reduces the complexity of tree traversal queries to simple wildcard sort of expressions. Queries involving ltree operations are called lquery.
For example, the expression Texas.* matches any path that starts with the label Texas; whereas *.Texas.* will match any path that contains Texas. There are a ton of other operators that you can use - refer to the Postgres documentation on ltree.
The ltree also supports B-tree indexing - as well as GiST.
CREATE DATABASE world_tree WITH ENCODING 'UTF8';
CREATE TYPE admin_level AS ENUM ('country', 'state', 'city');
CREATE TABLE world (
name TEXT PRIMARY KEY,
label TEXT CHECK (label ~* '^[A-Za-z0-9_]{1,255}$'),
population BIGINT,
location_type admin_level,
path ltree
);
CREATE INDEX path_idx ON world USING btree(path);
INSERT INTO world(name, label, population, location_type, path) VALUES
('United States', 'United_States', 329500000, 'country', 'United_States'),
('Australia', 'Australia', 25690000, 'country', 'Australia'),
('Spain', 'Spain', 47350000, 'country', 'Spain'),
('Texas', 'Texas', 29000000, 'state', 'United_States.Texas'),
('California', 'California', 39510000, 'state', 'United_States.California'),
('New South Wales', 'New_South_Wales', 8166000, 'state', 'Australia.New_South_Wales'),
('Victoria', 'Victoria', 6681000, 'state', 'Australia.Victoria'),
('Comunidad de Madrid', 'Comunidad_de_Madrid', 6642000, 'state', 'Spain.Comunidad_de_Madrid'),
('Dallas', 'Dallas', 1331000, 'city', 'United_States.Texas.Dallas'),
('Austin', 'Austin', 950807, 'city', 'United_States.Texas.Austin'),
('Houston', 'Houston', 2310000, 'city', 'United_States.Texas.Houston'),
('Los Angeles', 'Los_Angeles', 3967000, 'city', 'United_States.California.Los_Angeles'),
('San Francisco', 'San_Francisco', 874961, 'city', 'United_States.California.San_Francisco'),
('San Diego', 'San_Diego', 1410000, 'city', 'United_States.California.San_Diego'),
('Sydney', 'Sydney', 5312000, 'city', 'Australia.New_South_Wales.Sydney'),
('Newcastle', 'Newcastle', 322278, 'city', 'Australia.New_South_Wales.Newcastle'),
('Melbourne', 'Melbourne', 5078000, 'city', 'Australia.Victoria.Melbourne'),
('Geelong', 'Geelong', 253269, 'city', 'Australia.Victoria.Geelong'),
('Madrid', 'Madrid', 3223000, 'city', 'Spain.Comunidad_de_Madrid.Madrid'),
('M贸stoles', 'Mostoles', 207095, 'city', 'Spain.Comunidad_de_Madrid.Mostoles');The main limitation of this approach is that the ltree assumes that there is only one path from the root to any given node. If your tree is more akin to a funky graph than to a tree.
Although the design can be adapted to an arbitrary graph by extracting the path field out of the table and into a closure table with columns PK(node), path such that each node can have multiple paths there. This would require to execute the lquery on the closure table and then JOIN the matching paths with the actual locations table.
Another (worse) option would be to use the path as a primary key for a node, such that each node would be duplicated as many times as it has paths. But that unnecessarily adds redundancy and risks getting nodes incorrectly updated.
Another thing to consider is that the application code must take care of building the path. Building the path requires having access to the entire tree, which limits the possibility of executing isolated operations on parts of the tree.
Finally, if paths are likely to change frequently in your dataset, the materialized path approach may not be the best for you as you will need to update every node in the subtree whenever one of its ancestors changes.
Query: Get ancestors
Query complexity using an ltree is significantly reduced. lquery semantics takes a bit of practice to get used to but is not too crazy.
In this query, we are simply finding the path for the node whose label = 'Madrid'. Then we just find all nodes whose path is an ancestor to that path (e.g. A.B: A is an ancestor to node with path A.B).
SELECT name, path
FROM world
WHERE path @> (
SELECT path
FROM world
WHERE label = 'Madrid' -- node whose ancestors to find
);Query: Get descendants
The query is almost the same as the last one, except for the <@ operator, which matches descendants rather than ancestors.
SELECT name, path
FROM world
WHERE path <@ (
SELECT path
FROM locations_with_ltree
WHERE label = 'Victoria'
);Summary
There are many ways to store a tree structure into a database, none is ideal and you will have to see what makes the most sense given your use case. Good database design is what will work for you.
(Another technique I have not covered here but that I might add in the future is the technique of the Nested Sets technique.)