UniSuper, a $135 billion pension account, details its cloud compute nightmare.

cloud

Buried under the news from Google I/O this week is one of Google Cloud’s biggest blunders ever: Google’s Amazon Web Services competitor accidentally deleted a giant customer account for no reason. UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service. UniSuper thankfully had some backups with a different provider and was able to recover its data, but according to UniSuper’s incident log, downtime started May 2, and a full restoration of services didn’t happen until May 15.

UniSuper’s website is now full of must-read admin nightmare fuel about how this all happened. First is a wild page posted on May 8 titled “A joint statement from UniSuper CEO Peter Chun, and Google Cloud CEO, Thomas Kurian.” This statement reads, “Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription. This is an isolated, ‘one-of-a-kind occurrence’ that has never before occurred with any of Google Cloud’s clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again.”

In the next section, titled “Why did the outage last so long?” the joint statement says, “UniSuper had duplication in two geographies as a protection against outages and loss. However, when the deletion of UniSuper’s Private Cloud subscription occurred, it caused deletion across both of these geographies.” Every cloud service keeps full backups, which you would presume are meant for worst-case scenarios. Imagine some hacker takes over your server or the building your data is inside of collapses, or something like that. But no, the actual worst-case scenario is “Google deletes your account,” which means all those backups are gone, too. Google Cloud is supposed to have safeguards that don’t allow account deletion, but none of them worked apparently, and the only option was a restore from a separate cloud provider (shoutout to the hero at UniSuper who chose a multi-cloud solution).

UniSuper is an Australian “superannuation fund“—the US equivalent would be a 401(k). It’s a retirement fund that employers pay into as part of an employee paycheck; in Australia, some amount of superfund payment is required by law for all employed people. Managing $135 billion worth of funds makes UniSuper a big enough company that, if something goes wrong, it gets the Google Cloud CEO on the phone instead of customer service.

June 2023 press release touted UniSuper’s big cloud migration to Google, with Sam Cooper, UniSuper’s Head of Architecture, saying, “With Google Cloud VMware Engine, migrating to the cloud is streamlined and extremely easy. It’s all about efficiencies that help us deliver highly competitive fees for our members.”

The many stakeholders in the service meant service restoration wasn’t just about restoring backups but also processing all the requests and payments that still needed to happen during the two weeks of downtime.

ARS VIDEO

How The Callisto Protocol’s Team Designed Its Terrifying, Immersive Audio

Highlights from the outage timeline

The second must-read document in this whole saga is the outage update page, which contains 12 statements as the cloud devs worked through this catastrophe. The first update is May 2 with the ominous statement, “You may be aware of a service disruption affecting UniSuper’s systems.” UniSuper immediately seemed to have the problem nailed down, saying, “The issue originated from one of our third-party service providers, and we’re actively partnering with them to resolve this.” On May 3, Google Cloud publicly entered the picture with a joint statement from UniSuper and Google Cloud saying that the outage was not the result of a cyberattack.

Monday, May 6, is when things started to heat up. First was the morning statement saying both teams worked through the weekend to try to fix this, but then the next two outage page updates were lengthy statements/apologies signed by Chun. The UniSuper CEO assured members that “member accounts are safe,” “no data was exposed to unauthorized third parties,” and that “pension payments have not been disrupted.” When your service is close to being a bank, there’s going to be a lot of panic out there when there are several days of unexplained downtime.

The CEO’s update also stated that “While a full root cause analysis is ongoing, Google Cloud has confirmed this is an isolated one-of-a-kind issue that has not previously arisen elsewhere. Google Cloud has confirmed that they are taking measures to ensure this issue does not happen again.” Chun also mentioned that UniSuper had a second cloud provider, and it would work to “minimize” data loss. On May 7, the CEO added, “Google Cloud has issued a statement today which confirms again that the fault originated within their service as a ‘one of its kind,’ unprecedented occurrence” and that “Google Cloud sincerely apologizes for the inconvenience this has caused.”Advertisement

Seven days after the outage, on May 9, we saw the first signs of life again for UniSuper. Logins started working for “online UniSuper accounts” (I think that only means the website), but the outage page noted that “account balances shown may not reflect transactions which have not yet been processed due to the outage.” An earlier update pegged “April 29” as the planned data rollback for balances. The next seven days of updates log progressive restorations of various features of the website and app. May 13 is the first mention of the mobile app beginning to work again. This update noted that balances still weren’t up to date and that “We are processing transactions as quickly as we can.” The last update, on May 15, states, “UniSuper can confirm that all member-facing services have been fully restored, with our retirement calculators now available again.”

The joint statement and the outage updates are still not a technical post-mortem of what happened, and it’s unclear if we’ll get one. Google PR confirmed in multiple places it signed off on the statement, but a great breakdown from software developer Daniel Compton points out that the statement is not just vague, it’s also full of terminology that doesn’t align with Google Cloud products. The imprecise language makes it seem like the statement was written entirely by UniSuper. It would be nice to see a real breakdown of what happened from Google Cloud’s perspective, especially when other current or potential customers are going to keep a watchful eye on how Google handles the fallout from this.

Anyway, don’t put all your eggs in one cloud basket.

LEAVE A REPLY

Please enter your comment!
Please enter your name here