More than two years after we announced support for Amazon Redshift in Keboola Connection, it’s about the friggin’ time to bring something new to the table. Something that will propel us further along. Voila, welcome Snowflake.
About 10 months ago we presented Snowflake at a meetup hosted at the GoodData office for the first time.
Today, we use Snowflake both behind the Storage API (it is now the standard backend for our data storage) and the Transformations Engine (you can utilize the power of Snowflake for your ETL-type processes). Snowflake’s SQL documentation can be found here.
What on Earth is Snowflake?
It’s a new database, built from scratch to run in the cloud. Something different that when a legacy vendor took an old DB and hosts it for you (MSSQL on Azure, Oracle in Rackspace or PostgreSQL in AWS).
Snowflake is different. Perfectly elastic, frighteningly fast, with no limits for data storage (you can’t physically fill up a disc array to run out of space - there’s no “storage provisioning”, you get whatever you need) and what is most important, you can’t “kill it” or overload it with a dumb query. In Snowflake, you choose the power that is available for a particular query. You can even have two workers running two different queries concurrently over the same data, having absolutely no impact on each other, and on top of it you can speed them up or slow them down AS THEY’RE RUNNING. (check out this post and the video underneath for more info).
Until it is given that any database can be killed by a stupid query, it will remain the IT’s job to block your access to any production Teradata/Oracle/MSSQL… In Snowflake, you can only give trouble to one worker.
What does “no limits” mean?
During our beta testing we got some data from Rockaway into Snowflake transformation - three tables, a few million rows each (data volumes post-compression):
and ran this query
Which took 40 minutes (on x-small dwh) to create nearly a 0.5TB of data. It could’ve made 5TB. Or 50TB. Or ...whatever. Simply unlimited. There are no limits in Snowflake, just in your wallet #cloud (and, if needed, we could’ve used more powerful worker for that query that would need mere 5 minutes).
What does it mean for Keboola?
Four important points:
- Snowflake is now the default storage backend. Thanks to its elasticity we have no real boundaries - adding 50GB of data no more means the need of adding another Redshift node, with its cost we’d have to account for in the customer pricing
- We can run various processes over the data without affecting the client’s performance. So while our user runs his SQL/R/Python/Docker components over their data, our “black boxes” distilling the secret sauces (data profiling, data health, descriptive statistics, predictive models etc.) can be running at the same time with no negative effects. This allows us to make Keboola into the smart partner it is meant to be, one that can help and recommend without breaking the bank
- Brutal, raw performance. Let’s say you have clickstream of data from a small customer who creates 100GB/week, where you need to apply an attribution model written in Python. You can easily find out that the value of the output of the model is $Y, while the cost of the computing power to come up with it was four times as much. Historically for example, doing anything serious with an RTB system output made no sense economically. Until now.
- Great friends at Snowflake Computing Inc. :)
Why not Google BigQuery?
As I am writing this, I’d ask myself: “Why not dump such “big data” into Google BigQuery?”
Simple answer, actually:
- BigQuery practically can’t edit data once written there
- Costs $5.00 per TB pulled through the DB operations (not written into BigQuery), so 10 simple queries each hour can cost you nearly $4k before you know it
How do I get my hands on it?
All KBC projects that are not currently powered by a dedicated Redshift (if you have one, you know about it) will be migrated to Snowflake automatically. Those with Redshift can “opt in”. Ping your Keboola people to help out and answer any questions.
Innovation doesn’t stop. In Summer 2013 we started playing with Redshift, in February 2014 we rolled out Redshift-powered Transformations just to start, only 8 months later, talking to Thomas:
Took about another year to get to a contract with Snowflake
Just so we could, another year later, bring this technology (a completely new thing to many of you) to your fingertips. What will be the next leap?
Last fall, Larry Ellison said in his Oracle Open Day keynote that Oracle stopped meeting the traditional players at their clients and described how cloud computing changes Enterprise business:
“In this new world of cloud computing, everything has changed. And almost all of our competitors are new, CX (customer experience) application specialist Salesforce.com and ERP/HR application specialist Workday are the SaaS competitors Oracle sees most frequently”.
“We virtually never, ever see SAP. This is a stunning change. The largest [enterprise] application company in the world is still SAP, but we never see them in the cloud.”
“So this is how much our world has changed. Our two biggest competitors, the two companies we watched most closely over the last two decades, have been IBM and SAP, and we no longer pay any attention to either one of them. It is quite a shock.… I can make a case that IBM was the greatest company in the history of companies, but they’re just nowhere in the cloud. SAP was certainly the largest [enterprise] application company that has ever existed. They are nowhere in the cloud.”
That’s easy to agree with! Oracle itself, of course, is faking it a bit as they missed the train just as well. If Larry got the absence of the traditionals such as IBM and SAP correctly, we can discount IBM and their “quantum cloud” and focus on players like D-Wave, TensorFlow from Google, H2o.ai and the direction taken by Apache Zeppelin or Apache Spark.
Check out the comparison of the new Snowflake backend against Redshift in Martin's blog here.