Ok, I’ll bite.

For starters, zz9pizza did up a better tech breakdown than i ever could here: https://zz9pzza.tumblr.com/post/616408796841000960/insert-normal-disclaimer-about-personal-opinions. If you want more info on the technology, I’d poke around there. 

But basically, the Christmas miracle part comes from two things:

  1. Everyone’s volunteering! Christmas spirit! Yay!
  2. AO3 is delivering a best-in-class website, on-premise (not in the cloud), using generic hardware (Supermicro is the server equivalent of a soda can that says COLA on it), and using free software. This is the most cost-effective way to deliver a service and while it’s becoming more accepted it is certainly not the norm.  

Those two things by themselves aren’t particularly miraculous, but the devil’s in the details. Supermicro’s documentation is ass, which means setting stuff up can take a while. And they’ve set up quite a bit. The work they’ve done on these servers runs from the commonplace (nginx “doing full page caching, html optimisation, priority queuing and sending load to the back-end”), to the more advanced (SYS-5018D-FN4T generic servers configured as pfsense firewalls) to the kind of modern magic that makes tagging and complex searching work (elasticsearch). 

Elasticsearch does indexing. What indexing means in this context, very briefly, is tying related documents and bits of data together. For very simple use cases (like logging, Elasticsearch’s primary use case) this is pretty easy to maintain. For tagging in AO3, which is dealing with non-predictable items, categories, relations, loads etc, you need to know what you’re doing or things can go sideways very quickly. Like they did lat year with the page slowness. I’m going to highlight a section of zz9pizza’s post:

We spent quite a lot of time looking into it and made both code changes and other systems changes, and people from elastic reached out to us and gave us advice ( thanks for that 🙂 ).

We ended up working out that the main issue we had was that bookmark searching could eat all of our search capacity so we did some work behind the scenes to ensure that those requests went into a separate queue. That queue was limited to allow only a few of those searched to run at once. Once we did that the cpu load on the elasticsearch which had been hitting 100% started topping out at about 70%.

Two things stand out here to me. One, elastic.co, a notoriously money-grabbing corporation volunteered help. Sure, they were probably individuals and not the corporation but a) in the US at least a corp can absolutely say “don’t fucking do your job for free” (whether its legal is another matter but haha capitalist utopia) and b) the people at elastic get paid very well to figure this shit out. That’s high quality volunteering. And the second thing that stands out is that the AO3 team then managed to re-architect their app to mitigate this in approximately (someone fact check me here) two weeks on volunteer time. Those people have mentally exhausting jobs and came home to bang out fixes in their spare time in a fraction of the time corporate dev teams do.

I just… Look, none of this is particularly magical. The hallmark of any good sysadmin or programmer when faced with new and unfamiliar technology is the ability to say “Gimme some time to figure it out” and then roll up their sleeves and get to work. The magic comes from loving an idea enough to want to do that well-paid work for free, at times of stress and for repeated abuse on this bluehell website. 

That, and being able to buy 5 servers for $60k. Like, actually fuck off.