Brain dump on the recent Figma database infrastructure articles
Warning: this post is in a stream-of-consciousness form. Approach it accordingly.
Figma recently posted a couple of articles (one and two) on the challenge they are facing to keep their database infrastructure up with the massive growth they are going through, and how they're solving it.
The last article especially was re-shared in some of my circles, and my initial gut reaction to the articles can be summed up in the immortal words of Jeff Goldblum in Jurassic Park: “Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should”.
After I (admittedly) quickly read the article a couple of times, I still did not "get it". IMHO is a red flag as critical infra stuff like that should be simple: simple to understand, simple to operate, simple to troubleshoot, simple to restore.
Had I been in their position, what would I have done?
I could have left the article at it, but I've made it an habit to never say something "negative" about other people's work or ideas without at least trying to constructively offer something else in alternative. So, let me try to live up to this habit.
After all, if I cannot offer an alternative, this might very well be the best solution. In other words: be effective, not right.
Understanding the context
Starting from the basics, I'll try to establish the general context in which this work was carried out. I find that it's absolutely critical to find where you are and which constraints are present before you embark on a journey to a new, future position. Just like in the real world, when you're planning to get somewhere you need to first find where you are on the map.
In the articles, the authors refer as the feat having been accomplished by a "database team" in the most recent article and by an "infrastructure team" in the previous one.
The database team is referred to as a "small team", and at the end of the most recent article we can gather a list a of present and former members which yields a count of 20 people. A 5-10% turnover, would give a current headcount of about between 18 and 19 people, to which we should add the author of the post, but we're most likely nitpicking.
At the end of the older article the database team lists 12 people instead. That's a 8 people (66% growth in one year). This growth seems justified, given the complexity of the challenge they needed to tackle.
First observation: whenever I hear "X team" where X is not a product or a feature my mind wanders off to Conway's law. I this case, I would conclude that the solution that was ultimately implemented can be explained by the organizational structure.
Note that I'm not judging, just stating a fact without negative or positive connotation.
The article also mentions that the teams had a significant Postgres RDS experience, so this database team is probably not just "any-database team", but in fact a "relational database team", and, to be more precise but maybe less accurate, in the Postgres-flavour of the relational kind.
The articles states that the database (or databases?) are in the terabytes of size, with tables in the billions. of rows. Such numbers would make most relational databases tricky to work with (except maybe Oracle, but then it's a different can of worms). This can be explained by the amazing success that Figma enjoys in its field since it obliterated its competition in recent years.
Last item, DB Proxy. DB Proxy reminded me of the sprouter (stored procedure router) that Etsy tried way back in 2008. It did not work out at Etsy, however it seems to serve a different purpose from DB Proxy. Note that Etsy mentions Conway's law as well when referring to sprouter design decisions.
What options?
Now that the Database team has bought themselves more runway, I think it would make sense for the organizations to pursue a different solution with a lower cognitive load.
My understanding is that the current database sharding breaks referential integrity, and with that, we lose one of the major incentives for staying with a relational database (the others being that SQL is well understood by developers, supported by all programming languages, performs, and benefits from excellent operational support).
My (limited) understanding of how DBProxy works makes me conclude that DBProxy removes another couple benefits from those I just listed. That leaves very little to work with, if we consider that perform is under scrutiny too.
I would start by using Domain Driven Design to explore if it might be sensible to break down the entire business domain in separate domains (i.e. users, files, collaboration). That could then lead to a re-architecture around these newly defined domains, most likely with microservices.
If there were very large (still relational, at this point) databases, the scalability issue could be addressed by exploring NoSQL or other Postgres-native scaling options (timescaledb, citus).