Keboola #YearInReview: Customer & Partner Highlights

It’s been quite an exciting year for us here at Keboola and the biggest reason for that is our fantastic network of partners and customers -- and of course a huge thanks to our team!  In the spirit of the season, we wanted to take a quick stroll down memory lane and give thanks for some of the big things we were able to be a part of and the people that helped us make them happen!


snowflakepng

Probably the biggest news from a platform perspective this year came about two years after we first announced support for the “nextt” data warehouse called Amazon Redshift.  At the time, it was a huge step in the right direction.  We still use Redshift for some of our projects (typically due to data residency or tool choice) but this year we were thrilled to announce a partnership born in the cloud when we officially made the lightning fast and flexible Snowflake the database of choice behind our storage API and the primary option for our transformation engine. Not to get too far into the technical weeds (you can read the full post here,) but it has helped us deliver a ton of value to our clients (better elasticity and scale, huge performance improvement for concurrent data flows, better “raw” performance by our platform, more competitive pricing for our customers and best of all, some great friends!)  Since our initial announcement, Snowflake joined us in better supporting our European customers by offering a cloud deployment hosted in the EU (Frankfurt!)  We’re very excited to see how this relationship will continue to grow over the next year and beyond!


tableaujpg

One of our favorite things to do as a team is participate in field events so we can get out in the data world and learn about the types of projects people work on, challenges they run into, and find out what’s new and exciting.  It’s also a great chance for our team to spend some time together as we span the globe - sometimes Slack and Goto Meeting isn’t enough!

SeaTug in May

We had the privilege of teaming up with Slalom Consulting to co-host the Seattle Tableau User Group back in May.  Anthony Gould was a gracious host, Frank Blau provided some great perspective on IoT data and of course Keboola’s own Milan Veverka dazzled the crowd with his demonstration focused on NLP and text analysis.  Afterwards, we had the chance to grab a few cocktails, chat with some very interesting people and make a lot of new friends.  This event spawned quite a few conversations around analytics projects; one of the coolest came from a group of University of Washington students who analyzed the sentiment of popular music using Keboola + Tableau Public (check it out.)

                                               seatugJPG

Why Avast Hasn’t Migrated into Full Cloud DWH

How breaking up with Snowflake.net is like breaking up with the girl you love


At the beginning of May, I got a WhatsApp message from Eda Kucera:

“Cheers, how much would it cost to have Peta in Snowflake? Eda”

There are companies that rely only on “slide ware” presentations. Other companies are afraid to open the door to the unknown and not have their results guaranteed. Avast is not one of them. I am glad I can share with you this authentic description of Avast’s effort to balance a low-level benchmark, fundamental shift in their employees’ thinking and the no-nonsense financial aspect of all that.

Let’s get back to May. Just minutes after receiving Eda’s WhatsApp message, almost 6 months of deep testing began in our own Keboola instance of Snowflake. Avast tested Snowflake with their own data. (At this point I handed it to them, the rest was entirely in their hands.)

They dumped approximately 1.5TB a day from Apache Kafka into Keboola’s Snowflake environment, and assessed the speed of the whole process, along with its other uses and its costs.


With a heavy heart, I deleted Avast’s environment within our Snowflake on October 13. Eda and Pavel Chocholous then prepared the following “post mortem”:

Pavel’s Feedback on Their Snowflake Testing

“It’s like breaking up with your girl…”

This sentence is summing it all up. And our last phone call was not a happy one. It did not work out. Avast will not migrate into Snowflake. We would love it, but it can’t be done at this very moment. But I’m jumping ahead. Let’s go back to the beginning.

The first time we saw Snowflake was probably just before DataHackathon in Hradec Kralove. It surely looked like a wet dream for anyone managing any BI infrastructure. Completely unique features like cloning the whole DWH within minutes, linear scalability while running, “undrop table”,”select…at point in time” etc.


How well did it work for us? The beginning was not so rosy, but after some time, it was obvious that the problem was on our side. Data was filled with things like “null” as text value, and as the dataset was fairly “thin”, it had a crushing impact. See my mental notes after the first couple of days:

“Long story short — overpromised. 4 billion are too much for it. The query has been parsing that json for hours, saving the data as a copy. I’ve already increased the size twice. I’m wondering if it will ever finish. The goal was to measure how much space it will take while flattened, jsoin is several times larger than avro/parquet…."


Let me add that not only our data was bad. I also didn’t know that scaling while running a query would affect only the consequently run queries and not the currently running one. So I was massively overcharging my credit card having the megainstance of Snowflake ready in the background, while my query was still running on the smallest possible node. Well, you learn from your mistakes :). This might be the one thing I expected to “be there”, but I’m a spoiled brat. It really was too good to be true.

Okay, after ironing out all the rookie errors and bugs, here is a comparison of one month of real JSON format data:

(Data size within SNFLK vs. Cloudera parquet file 3.7TB (hadoop) vs. 4.2TB (Snowflake))

Tabulated results…You know, it is hard to understand what our datasets look like and how complicated queries they hold. The main thing is that benchmarks really suck :). I personally find  the overall results much more interesting:
  • Data in Snowflake are roughly the same size as in our Hadoop, yet I would suggest to expect 10% — 20% difference.

  • Performance: we didn’t find any blockers.

  • Security/roles/privileges: SNFLK is much more mature than Hadoop platform, yet it cannot be integrated with on-premise LDAP.

  • Stability: SNFLK is far more stable than Hadoop. We didn’t encounter a single error/warning/outage so far. Working with Snowflake is nearly the opposite to hive/impala where errors and cryptical and misleading error messages are part of the ecosystem culture ;).

  • Concept of caching in SNFLK cannot be fully tested, but we have proved that it affects performance in a pleasant yet a bit unpredictable way.

  • Resource governance in SNFLK is a mature feature, beast type of queries are queued behind the active queries while small ones sneak through etc.

  • Architecture of separated 'computing nodes' can stop inter-team collisions easily. Sounds like marketing bullshit, but yes, not all teams do love each other and are willing to share resources.

  • SNFLK can consume data from various sources from most of cloud-on/on-premise services (Kafka, RabbitMQ, flat files, ODBC, JSBC, practically any source can be pushed there). Its DWH as a service architecture is unique and compelling (Redshift/Google BigQuery/GreenPlum could possibly reach this state in the near future).

  • Migration of 500+ TB data? Another story  —  one of the points that undermine our willingness to adopt Snowflake.

  • SNFLK provides limited partitioning abilities; it can bring even more performance, once enabled at full scale.

  • SNFLK would allow platform abuse with all of its 'create database as a copy', 'create warehouse as a copy', 'pay more, perform more'. And costs can grow through the roof. Hadoop is a bit harder to scale which somehow guarantees only reasonable upgrades ;).

  • SNFLK can be easily integrated into any scheduler. Its command line client is the best one I’ve seen in last couple of years.

Notes from Eda

“If we did not have Jumpshot in the house, I would throw everything into Snowflake…”

If I was to build a Hadoop cluster of the size 100TB-200TB from scratch, I would definitely start with Snowflake…Today, however, we would have to pour everything in it, and that is really hard to do while you’re fully on-premise… It would be a huge step forward for us. We would become a full-scale cloud company. That would be amazing!

If I had to pay the people in charge of Hadoop US wages instead of Czech wages, I would get Snowflake right away. That’s a no brainer #ROI.

Unfortunately, we will not go for it right now. Migrating everything is just too expensive for us at the moment and using Snowflake only partially just doesn’t make sense.


Our decision was also affected by our strong integration with Spark; we’ve been using our Hadoop cluster as compute nodes for it. In SNFLK’s case, this setup would mean pushing our data out of SNFLK into the EC2 instance where the Spark jobs would be running. That would cost additional 20-30% (the data would be running inside AWS, but the EC2s cost something as well). I know Snowflake is currently working on a solution for this setup, but I haven’t found out what it is.

In our last phone call with SNFLK, we learned that storage prices were going down again. So, I assume that we will meet within a reasonable time frame, and reopen our discussion. (In November, Snowflake has started privately testing their EU datacenter and will open it publicly in January 2017.) In the meantime, we’ll have an on-demand account for practicing :).

Please hold, your call is important to us

We’ve recently experienced two fairly large system problems that have affected approximately 35% of our clients.

The first issue took 50 minutes to resolve and the other approximately 10 hours. The root cause in both cases was the way we handled the provisioning of adhoc sandboxes on top of our SnowflakeDB (a few words about "how we started w/ them").

We managed to find a workaround for the first problem, but the second one was out of our hands.  All we could do was fill in a support ticket with Snowflake and wait. Our communication channels were flooded with questions from our clients and there was nothing we could do. Pretty close to what you would call a worst-case scenario.! Fire! Panic in Keboola!

My first thoughts were like: “Sh..t! What if we run the whole system on our own infrastructure, we could do something now. We could try to solve the issue and not have to just wait…”

But, we were forced to just wait and rely on Snowflake. This is the account of what happened since:

New dose of steroids in the Keboola backend

More than two years after we announced support for Amazon Redshift in Keboola Connection, it’s about the friggin’ time to bring something new to the table. Something that will propel us further along. Voila, welcome Snowflake.

About 10 months ago we presented Snowflake at a meetup hosted at the GoodData office for the first time.

Today, we use Snowflake both behind the Storage API (it is now the standard backend for our data storage) and the Transformations Engine (you can utilize the power of Snowflake for your ETL-type processes). Snowflake’s SQL documentation can be found here.

What on Earth is Snowflake?

It’s a new database, built from scratch to run in the cloud. Something different that when a legacy vendor took an old DB and hosts it for you (MSSQL on Azure, Oracle in Rackspace or PostgreSQL in AWS).