We’ve recently experienced two fairly large system problems that have affected approximately 35% of our clients.
The first issue took 50 minutes to resolve and the other approximately 10 hours. The root cause in both cases was the way we handled the provisioning of adhoc sandboxes on top of our SnowflakeDB (a few words about "how we started w/ them").
We managed to find a workaround for the first problem, but the second one was out of our hands. All we could do was fill in a support ticket with Snowflake and wait. Our communication channels were flooded with questions from our clients and there was nothing we could do. Pretty close to what you would call a worst-case scenario.! Fire! Panic in Keboola!
My first thoughts were like: “Sh..t! What if we run the whole system on our own infrastructure, we could do something now. We could try to solve the issue and not have to just wait…”
But, we were forced to just wait and rely on Snowflake. This is the account of what happened since: