Dropbox had moved away from AWS for storing data but was still heavily centralised and dependent on their San Jose data centre
Dropbox recently unplugged an entire data centre to test its disaster readiness. With a dependency on their San Jose data centre, Dropbox ran a multi-year project to eliminate this single point of failure in their metadata stack, culminating in a deliberate and successful switch-off of the San Jose data centre.
Dropbox had moved away from AWS for storing data but was still heavily centralised and dependent on their San Jose data centre. The recovery time from an outage at San Jose was considered to be far more than what was desired – hence initiating a project to improve this in case of a significant disaster – such as an earthquake at the nearby San Andreas Fault. The improvement was measured as a Recovery Time Objective (RTO) – a standard measure from Disaster Recovery Planning (DRP) for the maximum time a system can be tolerably down after a failure or disaster occurs.
The overall architecture of Dropbox’s systems involves a system to store files (block storage) and another system to store the metadata about those files. The architecture for block storage – named Magic Pocket – allows block data to be served from multiple data centres in an active/active configuration, and part of this new resilience work involved making the metadata service more resilient, and eventually also active/busy too. Making the metadata stack resilient proved to be a challenging goal to achieve. Some earlier design tradeoffs – such as using asynchronous replication of MySQL data between regions and using caches to scale databases – forced a rethink of the disaster readiness plan.
Dropbox implemented other changes to help reduce risk, with a customer-first strategy:
- Routine testing of key failover procedures – regular automated small-scale tests of failover tasks
- Improved operational procedures – a formalised go/no-go decision point, checks leading up to a failover “countdown”, and clearly defined roles for people during a failover
- Abort criteria and procedures – clearly defining when and how a failover would be aborted
The final steps in preparation for unplugging San Jose involved making a detailed Method of Procedure (MoP) to perform the failover, and this was tested on a lower-risk data centre first – that at Dallas Fort Worth. After disconnecting one of the DFW data centres, whilst performing validations that everything was still working, engineers realised that external availability was dropping, and the failover was aborted four minutes later. This test had revealed a previously hidden single point of failure in an S3 proxy service.
Thanks to the new tooling and procedures, there was no impact to global availability, and the anti-climactic event was declared a success. This provided a significantly reduced RTO and proved that Dropbox could run indefinitely from another region without San Jose.
The key takeaways from this multi-year project were that it takes training and practice to get stronger at Disaster Readiness. Dropbox now can conduct blackhole exercises regularly, ensuring that the DR capabilities will only continue to improve, with users never noticing when something goes wrong.