Modeling Hierarchical Tree Data in PostgreSQL
March 4, 2022
(updated August 2, 2022)
You have completed 12,841 Medium questions on Leetcode and can write DFS and BFS blindfolded. You know all the tricks in the book. Your arborist skills seemingly know no bounds. But have you ever tried to store one of your trees into a database?
In this article, we go over some of the different options for storing a tree (or any kind of hierarchical data structure) into a database and more specifically PostgreSQL. If you are not using Postgres, note that the bulk of the concepts would also apply to other SQL databases.
In this article, we use the example of a geographical dataset containing a hierarchy of , and .
For example, our hierarchical dataset would be a big tree where the root is the World, and its descendants are trees like this:
The article covers:
- Static tree - hardcoding each node type without much abstraction or generic logic. You need to know your tree structure ahead of time for this one.
- Dynamic tree with recursive Common Table Expression (CTE) queries - each node has a foreign key pointing to its parent and we use CTE queries to query subtrees, find nodes and whatnot.
- Label tree (more commonly referred to as ) - each node has a materialized path from the root to itself. E.g. would be the for node .
Hierarchical data is common. Product categories nested into one another. Sneakers are footwear which are apparel. German movies are European movies and can further be split up into Comedy, Fiction, etc. Another example would be organizational charts with your boss at the top and you further down (😔) but hopefully not at the very bottom of the pyramid.
Let's take the example of a simplistic geographical administrative tree that represents a hierarchy of , and . We could throw in as well, but the idea is the same so for the sake of brevity, we leave out.
Approach #1: Static tree schema
If we are confident that our tree won't require the different node types to be treated generically, and that we won't need dynamic traversals or unbounded depth changes, we can create one table per node type.
Here, we can create one table for , one table for , one table for . If we had , we would need to make another table for that node type.
One nice thing we get out of the box with this design is that integrity constraints are easy to enforce. must belong to a and must belong to a . Not much ambiguity there.
But this design has limitations. If you changed your mind tomorrow and reckoned that you actually want to support , or some other hierarchical connections between different types of nodes, then you would need to make changes to your table schemas and / or add new tables.
If you are unlucky, you might find out that your problem domain is not as static as you thought it was, For example, with administrative geographical divisions, you might find out that there are disputed territories that are claimed by multiple countries. With our current design, we are unable to connect a to multiple so we would be stuck or have to somehow work around the issue with a bunch of little hacks.
You also have the potential downside of having redundancy requiring duplicated application code. Here, we have the field that is duplicated across every table - in a real-world scenario, we could have dozens of overlapping columns. Shared attributes across different models that serve a similar purpose could be a hint that extracting a base table might be a good idea.
Another major limitation is the lack of polymorphism. By that I mean that each table is treated as an entirely distinct entity and that their common qualities are not taken advantage of. Imagine you are building a location picker that lets users pick whatever location granularity they want. For example, you let them pick any node that is a or a . You would have to every table and return that to the client through your API.
If the user picks say and sends the primary key for that code to your application backend, your server will need to somehow figure out which table that primary key is from. For example, if has ; the frontend code sends to your backend, and now your backend has to figure out which table that is from (, , ) or rely on client-side provided type information, which would still need to be checked by your backend.
Query: Get ancestors
It is relatively easy to get the entire hierarchy from root to node using a combination of and sets. However, you can already get a feel that query complexity (or at least its length) will increase proportionally to the number of tables because there is nothing dynamic going on there. You need to type out the structure you expect for each level explicitly.
You can see how this could become an issue if you had a tree of or with an exploding depth that is unbounded. That gets unwieldy really quickly.
Query: Get descendants
To get nodes given a or root node, we need to create the hierarchical paths and then check if it contains the root node. There are probably better ways of doing that (probably involving doing away with the ). You could also create a to make the set easier to query (and more performant).
But you get the point, the query becomes complex and this is all fairly headache-inducing. If you want to be able to use tree operations and traversal queries, then you are better off modeling the data as an actual tree.
Approach #2: Dynamic tree with Recursive Common Table Expression (CTE)
In this approach, we extract a base type () and model the data as a generic tree in which each node has a reference to its parent node (or if they are root nodes).
For example, has a foreign key to , which itself has a foreign key to , which has no parent node ().
We are using an to represent the type of each node here, but we could have also used a foreign key to another lookup table.
This approach is much closer to a tree data structure and allows for much more dynamic traversal operations. It also allows for some form of polymorphism because every shares a common table.
It allows for recursive operations using Common Table Expression (CTE). With the right indexes in place, recursive queries perform very well even over large datasets.
Some complexity results from extracting a base table as it may require additional work to model subtype-specific fields (e.g. may need a or attribute for example).
While the ins and outs of modeling inheritance in a database is beyond the scope of this article, here are some references on how you could go about modeling subtype-specific attributes:
- Single table inheritance. Add the fields of all location types to table and set the fields that are not relevant to the specific node to .
- Concrete table inheritance. Create an additional table for each location type and possibly duplicate some data that is also present in .
- Class table inheritance. Common fields are stored in the supertype () while type-specific fields are stored in their type-specific table, mapped through the use of a shared primary key. An implementation variant of this is called exclusive arc.
- JSONB column. This is just a column that contains binary-encoded . You serialize the and store it in there. Not an ideal solution as it could severely restrict querying performance and flexibility on the properties stored in that column.
Query: Get ancestors
We use a recursive CTE query. For each level in the tree, we append the ancestor we find into an array and go up one level until we hit the node.
This schema feels a lot more natural as it opens up the possibility for tree operations.
This modeling of the tree also naturally improves flexibility and dynamism. Adding a new layer or sibling node types only requires adding a new .
Query: Get descendants
We use a recursive CTE query here too, which gives us rows that are then easy to query - if you want the descendants of say , then just add and you are good to go.
Approach #3: Label tree (ltree) materialized path
This approach makes use of a label tree, also known as .
This is an implementation of a materialized path which describes the path from a to a node on a tree. This technique reduces the complexity of tree traversal queries to simple wildcard sort of expressions. Queries involving operations are called .
For example, the expression matches any path that starts with the label ; whereas will match any path that contains . There are a ton of other operators that you can use - refer to the Postgres documentation on .
The also supports B-tree indexing - as well as GiST.
The main limitation of this approach is that the assumes that there is only one from the root to any given node. If your tree is more akin to a funky graph than to a tree.
Although the design can be adapted to an arbitrary graph by extracting the field out of the table and into a closure table with columns such that each node can have multiple paths there. This would require to execute the on the closure table and then the matching paths with the actual table.
Another (worse) option would be to use the as a primary key for a node, such that each node would be duplicated as many times as it has paths. But that unnecessarily adds redundancy and risks getting nodes incorrectly updated.
Another thing to consider is that the application code must take care of building the . Building the path requires having access to the entire tree, which limits the possibility of executing isolated operations on parts of the tree.
Finally, if are likely to change frequently in your dataset, the materialized path approach may not be the best for you as you will need to update every node in the subtree whenever one of its ancestors changes.
Query: Get ancestors
Query complexity using an is significantly reduced. semantics takes a bit of practice to get used to but is not too crazy.
In this query, we are simply finding the for the node whose . Then we just find all nodes whose is an ancestor to that path (e.g. : is an ancestor to node with path ).
Query: Get descendants
The query is almost the same as the last one, except for the operator, which matches descendants rather than ancestors.
There are many ways to store a tree structure into a database, none is ideal and you will have to see what makes the most sense given your use case. Good database design is what will work for you.
(Another technique I have not covered here but that I might add in the future is the technique of the Nested Sets technique.)