Do's and Don'ts of Storing Large Trees in PostgreSQL
March 2, 2024
Context
If you are a big fan, you might have read my previous post about Modeling Hierarchical Tree Data in PostgreSQL - from two years ago. That article is one of my biggest hits to date. If you have not read it and are not up to speed on hierarchical data modeling, I recommend you read it before you continue reading this article.
Fast forward to today. I have now built a virtual filesystem that is used by thousands of users and stores tens of thousands of trees, which together add up to millions of (highly dynamic) nodes.
You could say I am somewhat of a refined tree connoisseur. The kind of tree connoisseur who corrects you when you say something is a "forest" but it is actually a "grove".
If you are currently trying to model hierarchical tree data in PostgreSQL (or some other relational database) and are wondering what the best way to do it is in your use case, you are in the right place.
In this new article, I am reporting for duty once more - arborist hat on - to bring you the lessons learned in the trenches of tree modeling. From avoiding long running complex transactions that fill up the connection pool and ruin your day (or night) whenever anything moves in your tree, to avoiding the complexity of maintaining a materialized path, I have got you covered.
Feel free to reach out if you need help with your tree(s) - I'll be happy to help you out.
1. The parent_id column
The most common way of storing a tree in a relational database is to use a foreign key that points from the node to its parent node in the same table. If you are dealing with more of a graph, you can have a separate table that stores the edges between the nodes.
In this approach, the table typically looks like this:
Always (almost).
The foreign key is a must in almost all cases. Even if you are using a materialized path, you should still have a foreign key to enforce the tree structure and be able to rebuild the materialized path from if it becomes inconsistent.
When your tree is coming from an external source and you are only reading from it.
In this scenario where you are passively replicating a tree into your system, you might not need to use a foreign key. The reason is that you will not need to enforce the tree structure and will only passively read from something assumed to be a valid tree.
In this case, not enforcing foreign key constraints will also let you write nodes without having to worry about the order in which writes are done. For example, you could write a child before you write its parent, knowing that the source of the tree will eventually pass on the parent node.
(+) Consistency is easy
The advantage of the approach of storing the on each node row is that it is simple to move subtrees around. You just need to update the of the node you want to move and you won't need to update any other nodes down the subtree you are moving.
Concurrent updates to the tree can be done fairly easily. It may even be safe enough not to need to use a transaction as the nodes are independent from each other. You could move some node to a new parent and move one of its descendants at the same time and there would not be a conflict or risk of inconsistency.
If you are building a filesystem or any other tree-based system where interactions are mostly done level by level and do not require traversing the tree, your queries will also be very simple.
If you do need to traverse the tree, you will have to do a little more work. You will probably want to use a recursive common table expression (CTE) query to get all the descendants of a node. While those queries can be a bit complex, they are efficient and not that hard to write either.
(-) Inserting by path is harder (but that's okay)
Chances are that sooner or later, you will need to create a node and all of its ancestors if they don't exist.
For example, you might have your table that contains nodes generated dynamically, but now you want to create a node with a specific path, say .
There are several reasons why you will likely want to do this over the lifetime of your application. The scenario would likely be along the lines of something like this:
- You have a dynamic tree where the clients create arbitrary nodes dynamically, but at some point you find yourself needing to insert a specific node or subtree into those trees for some ad-hoc reason. For example, you might want to insert a node that represents a backup of the system at a specific point in time.
To insert such nodes by path, using the column to model the relationships, you will have to do something like this (or some other implementation of the same logic):
Chances are you will probably want to wrap that in a transaction - not great for write performance.
Something else adding to the tediousness is that if your nodes are identified by a randomly generated UUID or ID, you will need to maintain an additional column to store the path components.
As an example, say your tree looks like this:
Now say you want to insert a new node with the path into this tree.
First you will realize that as soon as you are dealing with tree paths, you must have a unique constraint enforced on the column that you want to represent the path or path components.
If you don't have that and only enforce uniqueness on the of the node, then your nodes won't be addressable by a . Though that may be obvious, if you have two nodes with a set to be , then you cannot have a path based on the name, like because it would map to two different nodes rather than to a single node.
Here you have a couple of variants of the same option that you can go with:
Option 1: store the full path
Store the path from the root to the node as a string and update it on every insert, delete, or move. For example, here, each node would have a column named that would contain the full path from the root to : , , , . The column would need to have a constraint.
If you go with this option, you have denormalized your data and are dealing with the same complexity as with materialized paths. You will have to maintain the column on every move operation to prevent it from becoming inconsistent with the hierarchy modeled by relationships.
That being said, it could be an acceptable tradeoff if you are using that as a workaround to write few near-static hierarchies into an otherwise dynamic tree. You could treat the special path-based nodes as a separate entity that does not behave like the rest of the tree - and in doing so you could contain the denormalization impact to a small part of your system.
For example:
If you do this, you could have a separate module that only deals with special nodes and does not touch the normal nodes.
You could simply filter out the special nodes from the rest of the tree when you are querying the table, or prevent certain actions on special nodes. When dealing with special nodes, you would then use the column to traverse the tree instead of using the , but could still operate the dynamic tree without caring about the .
That would be a decent way of isolating the denormalization impact and abstracting it away from the mainstream tree operations.
Option 2: store the path component
Store only the path component of the node. For example, here, each node would have a column named that would contain the name of the node: , , , - the column would need to have a constraint.
If you go with this option, you will have to do some work to fetch each level one by one. For example, you would fetch the node, then fetch the node, and so on.
This is not a big deal if you are only dealing with a few levels, but it could be a problem if you are dealing with a lot of levels.
You will also need to have a constraint like .
2. Materialized paths
A materialized path is a string that contains the path from the root to the node - with a separating each node in the path. In PostgreSQL, you can use the extension to store materialized paths - but this is an implementation detail.
For example, the materialized path for the node in the tree below would be .
In this approach, the table typically looks like this:
There are a couple of scenarios where you might be better off using a materialized path:
- Your tree rarely changes and you need to query it in a way that requires traversing the tree across multiple levels (e.g. product taxonomy).
- The source of truth for the tree is external, and you are only reading from it and passively replicating it into your system.
In (almost) all other cases.
Here are some of the factors that will make you regret using a materialized path:
- You have a lot of writes to the tree.
- You have a lot of levels in the tree.
- You have a lot of nodes in the tree.
- You have a lot of concurrent writes to the tree.
- Nodes in the tree are volatile and change their position frequently.
(+) Traversal is easy
The main advantage of the materialized path approach is that you can query the tree without recursion. You can retrieve all the descendants of a node with a simple query like this:
Compared to a recursive common table expression (CTE) query, this is a lot simpler.
(+) Ancestry for free
Another advantage of the materialized path approach is that you get the ancestor of a node for free. If you have the path , you know that the parent of is , the parent of is , and so on.
(-) Consistency is hard
The main disadvantage of the materialized path approach is that it is hard to maintain consistency.
If you move a node, you will need to update the path of the node and all of its descendants. If you change the identifier for a node that you use in the path, you will also need to update the path of each node in the subtree.
What's more, you will need to do all of this in a transaction to prevent the tree from becoming inconsistent. If you do not do this in a transaction, you will risk the tree becoming inconsistent if something goes wrong in the middle of the operation.
For example, say you have a tree like this:
Now say the following two updates come in at the same time:
- Move the node with the path to be a child of the node with the path .
- Move the node with the path to be a child of the node with the path .
If you do not do these updates in a transaction, you will end up with an inconsistent tree where the two updates conflict with each other and overwrite each other's changes.
Worse yet, you may end up creating a cycle in the tree if you are not careful.
(-) Write performance
The materialized path approach is also not great for write performance. Since every move operation requires being done in a transaction and involves updating every node in the subtree, your transaction will be long, will lock a lot of rows, and will monopolize connections from the connection pool.
Conclusion
The foreign key is the way to go in most cases. If you need to retrieve the entire ancestry for a node and build a path out of it, in many cases, you might be better off computing it on the fly (and caching it) to save yourself the increased complexity that come with using a materialized path (or denormalization in general).