Skip to main content

Brain dump on the recent Figma database infrastructure articles

Warning: this post is in a stream-of-consciousness form. Approach it accordingly.

Figma recently posted a couple of articles (one and two) on the challenge they are facing to keep their database infrastructure up with the massive growth they are going through, and how they're solving it.


The last article especially was re-shared in some of my circles, and my initial gut reaction to the articles can be summed up in the immortal words of Jeff Goldblum in Jurassic Park: “Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should”.

After I (admittedly) quickly read the article a couple of times, I still did not "get it". IMHO is a red flag as critical infra stuff like that should be simple: simple to understand, simple to operate, simple to troubleshoot, simple to restore.

Had I been in their position, what would I have done?

I could have left the article at it, but I've made it an habit to never say something "negative" about other people's work or ideas without at least trying to constructively offer something else in alternative. So, let me try to live up to this habit.

After all, if I cannot offer an alternative, this might very well be the best solution. In other words: be effective, not right.

Understanding the context

Starting from the basics, I'll try to establish the general context in which this work was carried out. I find that it's absolutely critical to find where you are and which constraints are present before you embark on a journey to a new, future position. Just like in the real world, when you're planning to get somewhere you need to first find where you are on the map.

In the articles, the authors refer as the feat having been accomplished by a "database team" in the most recent article and by an "infrastructure team" in the previous one.

The database team is referred to as a "small team", and at the end of the most recent article we can gather a list a of present and former members which yields a count of 20 people. A 5-10% turnover, would give a current headcount of about between 18 and 19 people, to which we should add the author of the post, but we're most likely nitpicking.
At the end of the older article the database team lists 12 people instead. That's a 8 people (66% growth in one year). This growth seems justified, given the complexity of the challenge they needed to tackle.

First observation: whenever I hear "X team" where X is not a product or a feature my mind wanders off to Conway's law. I this case, I would conclude that the solution that was ultimately implemented can be explained by the organizational structure.
Note that I'm not judging, just stating a fact without negative or positive connotation.

The article also mentions that the teams had a significant Postgres RDS experience, so this database team is probably not just "any-database team", but in fact a "relational database team", and, to be more precise but maybe less accurate, in the Postgres-flavour of the relational kind.

The articles states that the database (or databases?) are in the terabytes of size, with tables in the billions. of rows. Such numbers would make most relational databases tricky to work with (except maybe Oracle, but then it's a different can of worms). This can be explained by the amazing success that Figma enjoys in its field since it obliterated its competition in recent years.

Last item, DB Proxy. DB Proxy reminded me of the sprouter (stored procedure router) that Etsy tried way back in 2008. It did not work out at Etsy, however it seems to serve a different purpose from DB Proxy. Note that Etsy mentions Conway's law as well when referring to sprouter design decisions.

What options?

Now that the Database team has bought themselves more runway, I think it would make sense for the organizations to pursue a different solution with a lower cognitive load.

My understanding is that the current database sharding breaks referential integrity, and with that, we lose one of the major incentives for staying with a relational database (the others being that SQL is well understood by developers, supported by all programming languages, performs, and benefits from excellent operational support).
My (limited) understanding of how DBProxy works makes me conclude that DBProxy removes another couple benefits from those I just listed. That leaves very little to work with, if we consider that perform is under scrutiny too.

I would start by using Domain Driven Design to explore if it might be sensible to break down the entire business domain in separate domains (i.e. users, files, collaboration). That could then lead to a re-architecture around these newly defined domains, most likely with microservices.

If there were very large (still relational, at this point) databases, the scalability issue could be addressed by exploring NoSQL or other Postgres-native scaling options (timescaledb, citus).

Closing thougths

Figma's Database team achieved an impressive feat with their sharding implementation. They should be proud of their achievement and they are damn right to share it with the world!

Hindsight is always 20/20, especially when you can write comfortably from your couch without the pressure of the rapid growth they've experienced :)

Comments

Popular posts from this blog

Mirth: recover space when mirthdb grows out of control

I was recently asked to recover a mirth instance whose embedded database had grown to fill all available space so this is just a note-to-self kind of post. Btw: the recovery, depending on db size and disk speed, is going to take long. The problem A 1.8 Mirth Connect instance was started, then forgotten (well neglected, actually). The user also forgot to setup pruning so the messages filled the embedded Derby database until it grew to fill all the available space on the disk. The SO is linux. The solution First of all: free some disk space so that the database can be started in embedded mode from the cli. You can also copy the whole mirth install to another server if you cannot free space. Depending on db size you will need a corresponding amount of space: in my case a 5GB db required around 2GB to start, process logs and then store the temp files during shrinking. Then open a shell as the user that mirth runs as (you're not running it as root, are you?) and cd in

From 0 to ZFS replication in 5m with syncoid

The ZFS filesystem has many features that once you try them you can never go back. One of the lesser known is probably the support for replicating a zfs filesystem by sending the changes over the network with zfs send/receive. Technically the filesystem changes don't even need to be sent over a network: you could as well dump them on a removable disk, then receive  from the same removable disk.

How to automatically import a ZFS pool built on top of iSCSI devices with systemd

When using ZFS on top of iSCSI devices one needs to deal with the fact that iSCSI devices usually appear late in the boot process. ZFS on the other hand is loaded early and the iSCSI devices are not present at the time ZFS scans available devices for pools to import. This means that not all ZFS pools might be imported after the system has completed boot, even if the underlying devices are present and functional. A quick and dirty solution would be to run  zpool import <poolname> after boot, either manually or from cron. A better, more elegant solution is instead to hook into systemd events and trigger zpool import as soon as the devices are created.