Agency - get rid of pivot tables !

During my midnight oil hours and rumbling through out our internal systems, I have come across the ZenDesk tickets that our data analysts are closing for one of hour clients - H1 agency (part of GroupM).

At H1.cz they have created a report in GoodData which is called “non active campaigns”. It contains one metric, 5 attributes of type data, client, etc. and 4 filters (time, client’s agency, etc….) - It sounds super simple, but let’s take a closer look.



What it does, it gives you back a table, which is a wet dream of any and all agencies out there. You can see “anything” across all of the advertising channels. I mean “anything." In this particular case, they’ve created a report of non active campaigns. After some time this is a very good example of an output that is very hard or impossible to achieve in things like Tableau, SAP, Chartio, periscope.io or RJMetrics. Rock&Roll of the multi-dimensional BI system! You need to live it to believe it and to actually understand it.

Bellow is the data model (non readable on purpose), the yellow ovals are the things on top of which you count and you can see them in the context of green ones:


Karel Semerák from H1.cz has prepared this report. I bet he has no clue what mega machine he put into works so it would actually produced this. GoodData has based on the physical data model, definition of metrics and report context generated 460 rows of SQL in the datawarehouse which propells the system. 

Just imagine that there is a real person that tries to do this report by the hand (totally ignoring the incomprehensible amount of data), he has to do lots of small tasks (look inside AdWords, find the active clients, count number of their campaigns, compare to CRM data for paid invoices, create temporary pivot table, etc.)  and every little task could be represented by the rectangle inside this picture:


It all comes to almost 90 totally different tasks each taking from a minute to 3 days when done by hand. Try to explain this workflow to a Teradata consultant and you will spend a week just explaining what you want, try it with IBM Cognos expert and…well you get the picture.
And one more thing, with GoodData you do it yourself and don’t have to wait another month for the expert nor pay 5 digits sum for one report.

Well played, GoodData & multi-dimensional BI! 

But for a moment let’s forget about this one report. H1.cz has already prepared over 400 of such reports. Try to produce that in Excel and you better have a hord of MS Excel devotees who work hard as robots and are precise as robots. Last time I have seen something like this was years back at OMD. It was a mega office and all of the people inside have been producing pivot tables.

Talking about robots. If you are interested in counting the probability with which “data AI robots” will replace your job, take a look here.


Karel Semerák from H1.cz can stay cool though. He think about the data and the context instead of spending time on tasks where robots will always be better and that is one thing where robots will take some time to improve. Cognitive skills and context.

So next time your P/L starts nocking on your doors, think about giving your people the chance to creatively use their head and leave all the heavy lifting to robots. People aren’t the best at copy&paste or sorting through the AdWords report, but they are great at creative thinking and that is what you need in order to win over your competition.

Why aren’t there more nerds in marketing?

My arrival at Business Intelligence (BI) and eventually consulting for Keboola was not through your standard Statistics or Programming route. After completing a Bachelor’s degree in Business & Communications, I had several stints coordinating corporate marketing efforts in various industries from Automotive to Gaming. I found that no matter what capacity I worked in -  from tracking call-to-actions, to analyzing performance reports from suppliers and conducting market research - I could not hide from data. 

So in my never ending quest for efficiency I also started to ask myself why am I doing this and how meaningful is it? Or simply put, am I wasting my time coming up with tweets no one reads. #HashtagAllTheThings

                                              

Traditionally data analysis was never a core a competency of marketing, someone else from purchasing, finance, IT etc. would tell you if your campaign was successful. But with the shift towards digital marketing there’s been an increase in data availability and now more control over how marketers themselves can measure KPIs.

It was this trend that made me first curious about making the switch from Marketing to Analytics. The organizational gap was so apparent to me, but I had no idea what that translated to in terms of a job description. I was stuck between working in marketing where (for obvious reasons) the primary focus is on campaign implementation before measurement vs. a highly technical position (which I didn’t even have the qualifications for) that would stifle my creative side.

Caught in the middle, I came across a job posting at Keboola for a “Data Analyst”. At the time I had no idea what I was applying for, but through my experience in the past year I now see that the job description couldn’t have been clearer. Keboola like me is somewhere in the middle. With a pragmatic approach, we provide real solutions to our clients’ very real business issues.  

What I love about working here is that we help companies integrate both Market Intelligence (MI) and Business Intelligence (BI) for data-driven decision making.


In my role, I provide the (BI) tools to answer the “why” behind my clients’ marketing decisions and then visualize those findings so they can make more informed decisions (MI). What I’ve found through this experience that there are common problems afflicting marketers, for which I feel there is actually a solution already under your nose. My job at Keboola is to translate these observations into something actionable so marketers can be empowered to work with their data and spend their time creating something meaningful … less hashtags in the next tweet perhaps? 


Zig Zagging your way through the E-learning journey

We bank online, buy groceries online, watch movies online, we even date online … so why not become educated online?

The emphasis on eLearning has continued to grow in the recent years. Training has become streamlined to the point where anybody and everybody can learn without actually stepping foot into a  classroom. The experience allows for knowledge to be compressed and consistently delivered to every learner. At the same time, online training gives learners a new-found freedom. They can choose their physical environment (bunny slippers and cup of tea, perhaps?), how they reference materials, and in some cases the pace at which they learn.

Traditional linear learning is based on the idea that you must first build a foundation through a set of carefully predefined segments or lessons delivered in a specific order before tackling more complicated topics. And although some people may respond well to working through an established series of concepts, the linear approach often leaves many of us unenthused-and unmotivated.

Non-linear learning, in comparison, offers learners the freedom to actively construct a personalized educational journey. What does this freedom look like? It’s having the ability to make individualized choices based on background, expertise, and one’s own unique learning style.

Interest and retention towards subject matter is increased when learners are provided the opportunity to select material relative to their own lives, careers and projects. Choosing the sequence of learning materials allows learners to tailor the content to their individual needs, weaknesses and strengths.

Adopting the non-linear philosophy, eLearning programs like Zendesk Insights Advanced Learning offer users the best route to the information that is meaningful to them. There are several support tools available including videos, guides, tasks, and challenge problems, but the combination and timing of these tools makes each user experience unique. Leveraging the Non-Linear learning advantages in an online environment, lends flexibility for time management so users can engage with the tool at their own pace ... (because it doesn’t matter if you’re the turtle or the hare, as long as you pass the finish line ☺).


* We love to hear what our partners think, so for their thoughts on Zendesk Insights Advanced Learning check out http://g3t.ca/gvLQRm

Another year passed by

Yesterday my account in GoodData turned 5 years old!

It is one in the morning and the delivery service calls my phone; surprise! We’ve got champagne for you. It was from my colleagues. I’m alone and sick with the flu in bed, but almost shed a tear.

Every single day for the last 12 months I looked forward to going to work. And the main reason is the people at Keboola. Thank you guys, without you I would probably just sit at the cash register at TESCO and… well, whatever.

This seems like a good moment to look back at the last 12 months. This is not in any particular order - and not a complete list, either:

  • We adjusted our positioning. From “just selling and implementing” GoodData, we moved to data enablement and actually started to talk with everyone who might have a need to analyse data. You have the data, we will help you integrate it and get it “consumption ready”. If anyone wants to use highcharts.com they are more than welcome - we are here for our clients and “the tool” for our clients’ internal analytics guys to use. We also built several non-analytics applications, where the purpose is to deliver quality data into other companies’ products and platforms.
  • Our Keboola Connection ecosystem is growing rapidly. We are adding more and more new ways to push your data for analytics and data discovery. Along with GoodData, we support today Tableau , Chartio and are planning support for Birst, RJMetrics and Anaplan. I would love to have support for SAS soon as well. If you have a tool that can extract the data from DB, CSV from your hard drive or from a URL, you can have data from us today.
  • In total silence we have launched our “Apps Store”. The main part of it is our, still very juvenile app "LuckyGuess" and transformations templates which automate many daily routine type tasks. Our goal is to support any app that really helps our users/analytics by providing some added value. It automatically analyzes data or automates processes with the data. If you can deliver such an app in Docker, we offer the best place to monetise it -- we have the computing power and clients have their data with us already… Our LuckyGuess app is primarily written in R and it does very basic, yet fundamental things like detect relations between tables, lets you know the data types, detects dependencies (regressions) between the columns (“tell me which expenses bring the most customers") or it can detect purchasing patterns and let you know when to go to talk to a particular customer because he is most likely to buy. We are working on other apps, our own as well as partners driven !

  • Marc Raiser is back. A few years ago he received an offer that he couldn’t refuse - working with Fujitsu Mission Critical Systems Ltd. (managing data from large machines and building AI on top of that). At the time we joked that Japan would be just an apprenticeship program for him and that he would be back soon. And voila, he is back working with us again, for now on development of LuckyGuess!
  • The rebuilding of our platform into async mode is nearly finished. It will give us unlimited horizontal scalability.
  • Martin Karásek developed a new design of our UI. No longer bare Bootstrap! While implementing the new design we also redesigned our whole approach to technical implementation of UI, and today everything will be an SPA application, on top of our APIs. Any partner of ours can skin it as he wishes and run it from his own servers if he has a need for that. Sneak peek of the Transformations UI:

  • We organised our first Enterprise Data Hackathon
  • We reorganized the company into two verticals - product and services. Our Keboola Connection team actually doesn’t have any direct clients any more. Everything is done through partners. Currently we have 7 partners. Our definition of a partner is any other entity which has a data business of their own and uses a tech stack from us to support their business. In just the last month we were approached by 4 more companies.
  • We now have a third partner in Keboola. So it’s me, Milan and Pavel Doležal. Pavel spends most of his time with our partners making sure they get all the right tools and support they need for their work and is leading the development of our partner network.
  • Vojta Roček left us and went his own "BI” way. Today he is in a new ecommerce holding Rockaway and he is leading people down the data-driven business development path. Keboola Connection adoption within Rockaway is growing every day.
  • Our extractor-framework - the environment where third parties can write their own extractors - is done and ready to use. Today it takes us ½ a day to connect to a new API.
  • We are finishing the app that can read Apiary Blueprint and by doing so we shall be able to read data from any API that has its documentation in Apiary.io with minimum development.
  • Working on “schemas” - the possibility to use standardized nomenclature for naming and describing the data. Think of it as a "data ontology". It will allow creation of smarter Apps, as they will be able to understand the meaning of the data.
  • Just launched TAGs - it is like a form of dialogue between you and us about the data. It is enough for us if you just tag the column "location” and we will promptly serve you weather data for every address in the column. If you label a column as “currency” then right away you have the up to date currency exchange rates, etc.
  • We are still 25 people and growing without a need to add too many more.
  • Zendesk launched online courses for Zendesk Insights within our own Keboola Academy. We trained hundreds of people how to use GoodData.
  • Our “Team Canada” has moved into new offices.
  • We publish many components as open source. If it makes sense, we want to provide it for you for free. Our JSON2CSV convertor is a first sign of this trend. The dream would be to run the most used extractors for free as well.

So that's where we are, what we've been doing and where we're going, exciting times! 

Now, to take my medicine...

    Gorila Mobil - Data-Driven Business

    In July 2014, O2 announced that it had acquired Gorila Mobile (virtual mobile operator). Gorila approached us at Keboola a couple of months before their official launch.  They had one goal: “We need to have a data-driven company and we want you to help us set it up.”

    The brain behind Gorila is Roman Novacek, a brain that works a bit differently than yours or mine.

    18 months ago

    It was the morning of  23.4.2013 and I was on my way to one of the largest techhubs - TechSquare to see Roman. Back then Roman had started a company called Tarifomat which was offering its customers a way to find the best mobile operator deal and help them switch. Honestly, Tarifomat was a great idea with a very difficult execution path. Their sales funnel is verrrry long. Basically they get paid only a ½ year after the client switches and only in the case that he doesn’t cancel beforehand.  The path to getting paid is paved with unexpected traps like “the courrier that was delivering the SIM card hasn’t found the address” etc.  A perfect fit for us if the client could increase their margins by 30x. We rolled up our sleeves and proceeded with the project. It was a success. It had to be, our VP of Propaganda wrote that companies can count on us and we have to honor that promise.

    Tarifomat got a perfect overview of their whole funnel (up to 1500 requests a day). Roman says that it was the first time that he actually understood what was going on inside the company.

    7 months later I got a call, it was Roman. He was being as secretive as James Bond and mysteriously speaking about some new virtual operator, but couldn’t tell me more, only that the plan was to design the company from the ground up as a data-driven company. Once somebody starts to send these kinds of signals, I can’t help it, but I lose my ability to focus on anything other than “new data-driven company”. Well, it looked just like lot of talk, but shortly after that came the walk. Roman sent us the first payment and right after the brief, we began meetings and planning what exactly we would be solving together.

    Gorila was another virtual operator (inside the O2 network) and they tried to be very cool (check out the YouTube channel). But being cool is not the only ingredient for success….you need more….

    With our help Roman put together the daily dashboards which mapped out the full acquisition channel down to each campaign/media/type/position/brand message/product. The O2 team was taken aback when they saw that. The number of activated SIM cards was growing, expenses were falling - everybody was happy, champagne flowing everywhere…. Only Roman’s team wasn’t celebrating. People weren’t actually using the SIM cards as they had envisioned inside Gorila Mobile. Now what ???

    Friday afternoon:  "Let’s dig into the GoodData dashboards and solve this!"

    Sunday evening: Claim "Gorila mobil - the most of internet, FUP you!" changed to "5 Kč/hr for calls and all the data you need".

    A complete switch in brand positioning, based solely on data !  I get shivers running up my spine just thinking about that. While we were chilling in our offices, playing ping pong; Romans’ team was rocking it! 5Kc worked! - they got enough data to support their further visions; they were bold and full of energy.  

    Roman has a vision that data describes the now as it is and that we should use that knowledge to validate strategic directions and decisions.

    And we can see exactly the same pattern in DameJidlo. Gorila mobil is soaked deep down in data and it works. “There was no time to hide anything; everybody from the O2 callcentre to investors and partners has full access to all data” - daily orders, comparisons, value of the orders, SIM activations, number of customers and their behaviour, how frequently/and where they top-up their SIM cards, etc.

    Unfortunately, this great ride lasted only 3 months. The Gorila project was over, so successful that Telefonica decided to buy it and incorporate it inside the O2 structure.

    Roman moved on into a new project, and Keboola was ready. Anyhow, who can say you have a iPhone cover from the real cherry wood? We are very interested to see how they will handle the tricky things like life time warranty for the bamboo iPad cover and what role data will play in that business.

    I caught Roman during his trip through China where he was stuffing himself with chicken feet - so here’s a quick Google Docs style interview :)

    PS: Roman, what was the hardest thing in the beginning of Gorila ?

    RN: Convincing O2 that we have to be agile and have NO time for endless meetings. We wanted to focus 100% of our time and energy on marketing and we knew we needed the company to be 100% based on data.  In the beginning no one believed this vision inside O2. Today, (after the Gorla acquisition) O2 wants to have the same system as we had (using the sales / activation/ channels activation data from yesterday). Dusan Simonovic and Jiri Caudr are the stakeholders and I hope they will be successful with that. When you know what is going on inside the company, you don’t have to speculate . That gives you real power to make decisions and work hard to achieve you goals; because you know exactly WHAT you’re doing and WHY you’re doing it. No stumbling in the dark. That’s how you oddelis zrno od plev….

    PS: What do you mean by “odděluje zrna od plev”?

    RN: Well, you can have lots of excuses when you don’t have the data. You can come up with excuses about things that went wrong when you don’t have data to show. Investors have no way to prove you wrong. At least in short term. You can argue that the market has changed, some externalities have worked against you, etc.  Once most of your decisions are firmly based on data and anybody can see the results on an ongoing basis, you literally put your skin in the game. If you f..ck something up, anybody can see in the data what happened. What were the circumstances before the decision and how it looks right now. I love this. I got addicted to data driven business and now I can’t do it any other way. My head just works this way :)

    PS: This was your second project with Keboola. How was it for you working with us again?

    RN: Once I convinced my partners to build a “metrics driven company” the hardest part was to get all the data sources. Network information, info from marketing tables, Google Analytics, distribution and logistics data like post office, couriers, CMS, etc.  We got lots of help from Martin Hakl and is company Breezy. They did our web and all the data extractors so the data could flow to Keboola.

    PS: Could you show us something?

    RN: Yep, I have smuggled the axis a bit….Now you will ask what it is and how we worked with it, right ? :)  What you can see in the report is the SIM cards activation in time. The dotted lines are linear extrapolations - trends - so you can see the general trend, growing or stagnation. We have this report on everybody’s dashboard and we can filter it according to the channel where SIM card was activated (for example large newspaper stand network, post offices, lottery terminals, etc.)  Exactly in the same way you can see what activations we have according to a marketing channel. If you click on any point inside the report another dashboard opens and you can see how much money we get from that particular set of activations and how these people behave. It’s hard to describe, it is much better to show in reality :)

    PS: My favourite question is to ask people if they had any “aha moment.” The point at which you just realize that anything you’ve done until this point was wrong and you have to go 180 degrees different direction. Do you have a moment like this?

    RN: We were spending around 1.5M / month on advertising. Marketing wise, our ads did perform super well! When we drilled through the data we noticed that we had some campaigns that were totally bad and dragged down the overall average for our campaigns. The interesting thing was that you couldn’t see this when you looked at total campaigns because we had some campaigns which were super extra good and they minimised the deviation. If we didn’t have the data, we couldn’t have discovered this at all. We had some extra great campaigns and some shitty ones, but the average was OK. We could dig into the details and discover the bad performers, so we can turn them down and start all over again the next day. Every day we have sent out over 500SIM cards and we knew exactly how much they cost us, how long it will take them to connect to the network, how much they’re gonna spend and how long they will stay with us. We could have calm night dreams, because we had data :)

    PS:  Now you’re data guy forever?

    RN: You got it buddy! :-) In any company I start the data analytics will be the first thing that I will be taking care off. Once we see inside the data and what’s going on inside the company, we are better able to take risks and test new things.



    GoodData XAE: The BI Game-Changer (3rd part)

    For previous part (2/3), continue here


    Discovering the value you can’t see.

    Creating a query language is the most complicated task to be solved in BI. It’s not about saving big data, not about their processing, nor about drawing graphs and making API to have a good cooperation with our clients. You cannot buy a query language nor program it in a month.

    If the query language gets too complicated the customer won’t manage to work with it. If the query language gets too stupid the customer won’t manage to work with it the way he needs to. GoodData has a simple language to express any complicated questions about the data. At the same time, it has a device that helps it to apply the language to any complicated BI project (or logical data model). In the case of GoodData it has already been mentioned that MAQL/AQW – in my point of view- is the one that is irreplaceable. Furthermore, guys from Prague and Brno – Tomáš Janoušek, David Kubečka and Tomáš Jirotka have widened the AQE with a set of mathematical proofs (complicated algebra) that allow us quick tests of whether the new functions in AQE apply to any type of logical models. That’s how GoodData makes sure that the translations between (MAQL) metrics and some SQL in the lower databases are correct. AQE then helps a common user to overcome the chasm that separates him from low-level scripting.

    UPDATE 17. 11. 2013: MAQL is a query language that is translated with MAQL interpreter (before known as QT – “Query Tree” engine) into a tree of queries using the basis of logical model (LDM). These queries are actually definitions of “star joins” in which “Star Generator” (SJG) creates its own SQL queries in DB backend according to the physical data model (PDM – lays below LDM). The whole thing was created at the beginning by Michal Dovrtěl and Hynek Vychodil. The new implementation of AQE further helped to lay all of this onto a solid mathematical basis of ROLAP algebra (which is similar to relation algebra).

    After weeks of persuading and yes, bribes, I managed to beg lightly censored examples of queries that AQE creates out of metrics I wrote for this purpose. I guess this is the first time anyone has actually published this....

    For a comparison I used the data model out of the Report Master course in Keboola Academy and made this report from it:

    The right Y-Axis on the graph shows me how many contracts I have done in Afghanistan, Albania, Algeria and American Samoa in the last 13 months. On the left Y-Axis I can see with a blue line how much regular income my salespeople have brought me and the green line indicates how much was the median sales in a given month (the inputs are de-facto identical with the table at from Part 1 of the previous blog posts).

    The graph then shows me three metrics (as per the legend below the graph):

    • "# IDs” = SELECT COUNT(ID (Orders)) – counts the number of components.

    • “Avg Employee” = SELECT AVG(revenue per employee) – counts the mean of (auxiliary) metrics counting the sum of turnover to salesperson.

    • “Median Employee” = SELECT MEDIAN(revenue per employee) – counts the median of (auxiliary) metrics counting the sum of turnover of salespeople.

    and the auxiliary metrics:

    • "revenue per employee” = SELECT SUM(totalPrice (Orders)) BY employeeName (Employees) – counts the values of components (some sales) at the level of salesperson.

    For the most part, everything explains itself – maybe except “BY” which states that the money “totalPrice  (Orders)” is counted per salesperson and not chaotically within itself. I dare say that anyone who is willing and tries MAQL even a little bit is going to learn (or for that matter we can teach it to you with Keboola Academy any time☺).

    And now the most important thing... see below how AQE translates to the following SQL:

    With a little bit of exaggeration we can say that creating my report is actually quite difficult but thanks to AQE, it does not bother me at all.


    If these three hypotheses are valid:

    1. If GoodData won’t earn me a bunch of money, I won’t use it.
    2. I will earn a bunch of money, but only by the use of a BI project created to suit MY exact needs.
    3. BI that is created to suit MY exact needs is a complex matter that we can only manage with AQE.

    … then the basis of the success of GoodData is AQE.

    A footnote: the before mentioned MAQL metrics are simple examples. Sometimes it is necessary to build the metrics so complicated that it’s almost impossible to imagine what must happen to the background data. This is an example of metrics from one project where analytics stands upon the unstructured texts. Metrics counts the conversation topics in current time by moderators:

    Lukáš Křečan once blogged (CZ) that people are the greatest competition advantage of GoodData.

    Translation: “Our biggest competitive advantage is not a unique technology that no one else has. The main thing is people. ”

    People are the base. We cannot do this without them; it’s them who create the one and only atmosphere in which unique things are founded. However, one and the other are replaceable.  The biggest competition advantage of GoodData (as well as the intellectual property) is AQE. If we didn’t have it the user would have to click the reports in closed UI that would take away the essential flexibility. Without AQE, GoodData would classify itself with Tableau, Bime, Birst and others. It would become basically uninteresting and it would have to compete strongly with firms who build their own UI over “Redshifts”.

    AQE is an unrepeatable opportunity to get ahead of the competitors who then are only going to lose. No one else is able to implement their own new function into product with arbitrary data in arbitrary dimensions while analytically proving and testing the validity of their implementation.

    The line between the false image of, “this cool dashboard is very beneficial for me” and the “real potential that you can dig out of the data” is very thin… it’s name is customizing. It’s an arbitrary model for arbitrary data and arbitrary calculations over it. It can be called an extreme. However, without the ability to count a natural logarithm out of a share of figures of two time-periods over many dimensions, you cannot become a star in the world of analytics. AQE is a game changer on the field of BI and only thanks to it, GoodData redefines rules of the game. Today a general root, tomorrow K-means… ☺  

    Howgh!

    GoodData XAE: The BI Game-Changer (2nd part)

    For previous part (1/3), continue here


    An honest look at your data

    Moving forward with our previous example; uploading all of the data sources we use internally (from one side of the pond to the other) into an LDM makes each piece of information easily accessible in GoodData - that’s 18 datasets and 4 date dimensions.

    Over this model, we can now build dashboards in which we watch how effective we are, compare the months with one another, compare people, different kinds of jobs, look at the costs, profits and so on.

    Therefore, anything in our dashboard suits our needs exactly. No one dictated us how the program will work...this freedom is crucial for us. Thanks to it we can build anything that we want in GoodData – only our abilities matter in the question of succeeding and making the customer satisfied.

    What’s a little bit tricky is that a dashboard like this can be built in anything. For now let’s focus on dashboards from KlipFolio. They are good, however they have one substantial “but” – all the visual components are objects that load information out of rigid, and predefined datasets. Someone designed these datasets exactly for the needs of the dashboard and made it not possible to tweak -  take two numbers out of two base tables … and watch their quotient in time. A month-to-date of this quotient can be forgotten immediately… and not even think about the situation in which there are “many to many” linkages. The great advantage of these BI products (they call themselves BI but we know the truth) is that they are attractive and pandering. However, one should not assume in the beginning that he has bought a diamond, when in actuality it cannot do much more than his Excel. (Just ask any woman her thoughts on cubic zirconia and you’ll see the same result).

    Why is the world flooded with products that play on a little playground with walls plastered with cool visuals? I don’t know. What I know is that people are sailing on the “Cool, BigData analytics!” wave and they are hungry for anything that looks just a little like a report. Theme analytics can be done in a few days – transformation of transactions and counting of “Customer lifetime value”  is easy until everyone starts telling you their individual demands.

    No one in the world except GoodData has the ability to manage analytics projects that are 100% free in their basis (the data model) and to let people do anything they want in these projects without having to be “low-level” data analysts and/or programmers. Bang!

    So how does GoodData manage to do it?

    Everyone is used to adding up an “A” column by inputting the formula “=SUM(A:A)”. In GoodData you add up the “A” column by inputting the formula “SELECT SUM(A)”. The language used to write all these formulas in GoodData is called MAQL – Multi-dimensional Analytical Query Language. It sounds terrifying but everyone was able to manage it – even Pavel Hacker has a Report Master diploma out of the Keboola Academy!

    If you look back at my data model out of our internal projects you might say that you want the average number of hours out of one side of the data model but you want it filtered with the type of operation, put together according to the projects descriptions and the name of the client and you want to see only the operations that took place this weekend’s afternoons. All the metrics will look like “SELECT AVG(hours_entries) WHERE name(task) = cleaning". The multi-dimensionality of this language is hidden in the fact that you don’t have to deal with questions such as: what dimension is the name of the task in? What relation does it hold toward the number of worked hours? And furthermore – what relation does it hold towards the name of the client? GoodData (or the relations in the logical model that we design for our client) will solve everything for you.

    So getting straight to the point, if I design a (denormalized) Excel table in which you find everything comfortably put together, no one who reads this will have trouble counting it. If we give you data divided by dimensions (and dimensions will often be other sources of data – just like outputs from our Czech and Canadian accounting systems) it would be much more complicated to process (most likely you will start adding in SQL like a mad person). Since the world cannot be described in one table (or maybe it does – key value pair... but you cannot work with that very much) the look into a lot of dimensions is substantial. Without it, you are just doing some little home arithmetic ☺.

    Do I still have your attention? Now is almost the time to say “wow” because if you like to dig around in data, you are probably over the moon about our described situation by now ☺.

    And to the Finale...

    Creating a query language is the most complicated task to be solved in BI. GoodData on the other hand, uses a simple, yet effective language to mitigate any of these “complications” and express the questions you have about your data. Part 3 of our series will dive deeper into this language, known as MAQL, and its ability to easily derive insights hidden in your data.


    For final part (3/3), continue here

    GoodData XAE: The BI Game-Changer (1st part)

    Putting your data into the right context

    At the beginning of last summer, GoodData launched its new analytic engine AQE (Algebraic Query Engine). Its official product name is GoodDate XAE. However, since I believe that XAE is Chinese for “underfed chicken”, I will stick with AQE ☺. Since the first moment I saw it, I considered it a concept with the biggest added value. When Michael showed me AQE I immediately fell in love.

    However, before we can truly reveal AQE and the benefits that can be derived from it, we need to begin with an understanding of it’s position in the market - starting from the foundation on which GoodData’s platform rests. In a three part series we’ll cover AQE’s impact on contextual data, delivering meaningful insights and finally digging for those hidden gems.

    First, a bit more comprehensive introduction...

    Any system with ambitions to visualize data needs some kind of mathematical device. For instance, if I choose sold items using the names of salespeople as my input and my goal is to find out the median of the salespersons’ turnover, somewhere in the background a total summation of the sold items per month (and per salesperson) must take place. Only after getting that result can we count the requested median. Notice the below graphic - the left table is the crude input with the right table being derived in the course of the process - most of the time we don’t even realize that these inter-outputs keep arising. Within the right table, we can quickly calculate the best salesperson of the month, the average salesperson/median and so on…

    And how does this stack up against the competition?

    If we don’t have a robust analytic backend, we cannot have the freedom to do whatever we want to. We have to tie our users to some already prepared “vertical analysis“ (churn analysis of the e-shop’s customers, RFM segmentation, cohorts of subscriptions, etc…). Fiddling with the data is possible in many ways. Besides GoodData, you can find tools such as Birst, Domo, Klipfolio, RJMetrics, Jaspersoft, Pentaho and many, many others. They look really cool and I have worked with some of them before! A lonely data analyst can also reach for R, SPSS, RapidMiner, Weka and other tools. However, these are not BI tools.

    Most of the aforementioned BI tools do not have a sophisticated mathematical device. Therefore, it will simply allow you to count the data, calculate the frequency of components, find out the maximum, minimum and mean. The promo video of RJMetrics is a great example.

    Can I just use a calculator instead?

    Systems such as Domo.com or KlipFolio.com solve the problem of an absentee mathematical device in a bit of a bluffy way. They offer their users several mathematical devices – just the same as Excel does. The main difference is that they can be used with separate tables, not with the whole data model. Someone may think that it does not matter but, quite the contrary – this is the pillar of anything connected to data analytics. I will try to explain why...

    The border of our sandbox lays with the application of the law of conservation of “business energy”.

    “If we don’t manage to earn our customer more money than our services (and GoodData license) cost him, he won’t collaborate with us.“

    Say for example if we take the listing of invoices from SAP and draw a graph of growth, our customers will sack us from the offices. We need a little bit more. We need to put each data dimension into context (dimension = thematic data package usually presented by data table). Each dimension does not have to have any strictly defined linkages; the table in our analytics project is called dataset.

    But how is it all connected?

    The moment we give each dimension it’s linkage (parents, children … siblings?), we get a logical data model. A logical data model describes the “business” linkages and most of the time it is not identical with the technical model in which any kind of system saves it’s data. For example, if Mironet has its own e-shop, the database of the e-shop is optimized for the needs of the e-shop – not financial, sales and/or subscription analytics. The more the environment (of which we analyze the data) is complicated, the less similarities the technical and analytical data models have. A low structural similarity of the source data and the data we need to analytics, divides the other companies from GoodData.

    A good example of this is our internal project. I chose the internal project because it contents the logical model we need only for ourselves. Therefore, it is not somehow artificially extended just because we know “the customer will pay for it anyway”.

    We upload different kinds of tables into GoodData. These tables are connected through linkages. The linkages define the logical model; the logical model then defines “what can we do with the data”. Our internal project serves to measure our own activity and it connects the data from the Czech accounting system (Pohoda), Canadian accounting system (QuickBooks), the cloud application Paymo.biz and some Google Drive documents. In total, our internal project has 18 datasets and 4 date dimensions.

    The first image (below) is a general model, select the arrow in the left corner to see what a more detailed model looks like.

    In the detailed view (2 of 2), note that the name of the client is marked with red, the name of our analyst is marked with black and the worked hours are marked with blue. What I want to show here is that each individual piece of information is widely spread throughout the project. Thanks to the linkages, GoodData knows what makes sense altogether.

    UP NEXT

    Using business-driven thinking to force your data to comply to your business model (rather than the other way around) will allow you to report on meaningful and actionable insights. Part 2 of the following series on AQE (...or more formally XAE) will uncover the translation of the Logical Data Model into the GoodData environment.


    For next part (2/3), continue here

    Aggregation in MongoDB, Oracle, Redshift, BigQuery, VoltDB, Vertica, Elasticsearch, GoodData, Postgres and MySQL

    "Executive Summary"

    It kinda got out of my hands. It exploded...

    I’ve been trying to describe how to do the same procedure within different systems. In the end, I tried to do the same using GoodData and Keboola Connection, and I have attached a screenshot (see bellow). I know it’s more high-level than just the database, but I believe it shows the beauty and speed of the tool.

    Summary table:

    Intro

    At the end of last year I found this blogpost: "MongoDB 'Lightning Fast Aggregation' Challenged with Oracle”.  Lukas Eder did the same aggregation, using Oracle, as Vlad Mihalcea had done a week earlier with MongoDB.

    Lukas Edler gave me the source data — 50,000,000 lines of events with time and value:
    created_on,value
    2012-05-02T06:08:47Z,0.9270193106494844
    2012-09-06T22:40:25Z,0.005334891844540834
    2012-06-15T05:58:22Z,0.05611344985663891
    2012-01-05T20:47:19Z,0.2171613881364465
    2012-02-10T00:35:17Z,0.4581454689614475
    2012-06-19T17:46:41Z,0.9841521594207734
    2012-08-20T21:42:19Z,0.3296361684333533
    2012-02-24T20:29:17Z,0.9760254160501063

    Below, you’ll find 10,000 lines to get you started, as well as the whole data set. Both have headers, are without enclosures and use commas as delimiter-separated values (until mid-September you can download from my S3; then it will go to Glacier):

    Two tests are being done:

    • Test A - perform the aggregation in years, and days within the year, where the number of entries, the daily average value and the maximum value are recorded
    • Test B - exactly the same as Test A, but  with an hour filter applied
    I’ve tried to replicate the conditions as much as possible, using Amazon Redshift, Google BigQuery, VoltDB, HP Vertica, Elasticsearch, GoodData, Postgres and MySQL. The purpose is not really to find who is the fastest; that’s why I don’t rely on having exactly the same conditions.  To be exact, Google BigQuery is “unknown hw” so it wouldn’t be possible anyway. I’m more interested in how difficult — and how easy — it is to get the same result when using these different platforms. I have tasked Redshift with doing 10x the same - 500.000.000 lines, but these are 10x repetitions of the same data set. In the GoodData example I’ve added some complications, so you can see how easy it is to work with. 

    And here are the results: 

    MongoDB

    (for details se Vlada’s blog — link above)

    Test A: 129s
    Test B: 0.2s

    Oracle

    (for details see Lukas’s blog — link above)

    Test A: 32s
    Test B: 20s first run, 0.5s second run

    Redshift

    I’ve uploaded the data to Redshift from S3. I had to do these steps before running the test:
    1. create a table
    2. import the data
    3. the system couldn’t recognise the time format ISO8601, so I had to alternate the table
      1. add a column for timestamp
      2. set it up according to time stamp in the original data source
      3. delete the original column with date
    4. I have added SortKey, and did the ANALYZE and VACUUM commands

    (here you’ll find the exact list of SQL queries and times)

    Test A

    Exact query used:
    SELECT
         EXTRACT(YEAR FROM created_at),
         EXTRACT(dayofyear FROM created_at),
         COUNT(*),
         AVG(value),
         MIN(value),
         MAX(value)
    FROM RandomData
    GROUP BY
         EXTRACT(YEAR FROM created_at),
         EXTRACT(dayofyear FROM created_at)
    ORDER BY
         EXTRACT(YEAR FROM created_at),
         EXTRACT(dayofyear FROM created_at);
    Redshift dw1.xlarge (15s)
    Redshift dw2.large (7s)

    500.000.000 lines version:

    Redshift dw1.xlarge (182s)
    Redshift dw2.large (53s)

    Output

    Test B

    Used query:
    SELECT
         EXTRACT(YEAR FROM created_at),
         EXTRACT(DAYOFYEAR FROM created_at),
         EXTRACT(HOUR FROM created_at),
         COUNT(*),
         AVG(value),
         MIN(value),
         MAX(value)
    FROM RandomData
    WHERE
         created_at BETWEEN
         TIMESTAMP '2012-07-16 00:00:00'
         AND
         TIMESTAMP '2012-07-16 01:00:00'
    GROUP BY
         EXTRACT(YEAR FROM created_at),
         EXTRACT(dayofyear FROM created_at),
         EXTRACT(HOUR FROM created_at)
    ORDER BY
         EXTRACT(YEAR FROM created_at),
         EXTRACT(dayofyear FROM created_at),
         EXTRACT(HOUR FROM created_at);
    Redshift dw1.xlarge (1.3s)(first run)
    Redshift dw1.xlarge (0.27s)(second run)

    500.000.000 lines version:

    Redshift dw1.xlarge (46s)(first run)
    Redshift dw1.xlarge (0.61s)(second run)

    Output:

    And the query plan, for those who love it :)

    Google BigQuery:

    You can communicate with BigQuery using REST API  or web interface. I’ve used the client inside the console running on the server. I had to download sample data onto my drive.

    Necessary steps before the actual query:

    1. Create a project via https://console.developers.google.com/ and under API&auth authorize BigQuery (credit card has to be inserted)
    2. Create a “project” in BigQuery
    3. Import the data (this took a while, although I didn’t record the exact time)

    You can’t really tweak BigQuery too much. There are no indexes, keys, etc.

    Project set up from console:
    ./bq mk rad.randomData
    --0s

    Data import:
    ./bq load --noallow_quoted_newlines --max_bad_records 500 --skip_leading_rows=1 rad.randomData ./randomData.csv created_on:timestamp,value
    -- ~1500s (pity I haven't exact time)

    Test A

    The query:

    SELECT
      YEAR(created_on) AS Year,
      DAYOFYEAR(created_on) AS DayOfYear,
      COUNT(*) AS Count,
      AVG(value) AS Avg,
      MIN(value) AS Min,
      MAX(value) AS Max
    FROM [rad.randomData]
    GROUP BY
      Year, DayOfYear
    ORDER BY
      Year, DayOfYear;

    -- 7s (1.3GB)

    Test B

    The query:

    SELECT
      YEAR(created_on) AS Year,
      DAYOFYEAR(created_on) AS DayOfYear,
      HOUR(created_on) AS Hour,
      COUNT(*) AS Count,
      AVG(value) AS Avg,
      MIN(value) AS Min,
      MAX(value) AS Max
    FROM [rad.randomData]
    WHERE created_on >= '2012-07-16 00:00:00' AND created_on <= '2012-07-16 01:00:00'
    GROUP BY
      Year, DayOfYear, Hour
    ORDER BY
      Year, DayOfYear, Hour;

    -- 2.9s / 1.6s (cached)

    VoltDB

    I stumbled upon this database by chance (Q4/2013). They make lots of claims, but I was surprised to find you can’t even extract the day of year from the time stamp; you’ve got to make a pre-processing and prepare the data for it.

    My “tour” of the VoltDB ended with me calling their support where VoltDB Solution Engineer named Dheeraj Remella tried to help me (he was excellent!) and promised he would do the test for me. The actual email exchange took quite a while.

    Meanwhile, they managed to release version 4.0, which includes EXTRACT() function. Results follow:

    Data import:
    Read 50000001 rows from file and successfully inserted 50.000.000 rows (final)
    Elapsed time: 1735.586 seconds

    Test A:

    SELECT
         EXTRACT(YEAR FROM create_on_ts) AS Year,
         EXTRACT(DAY_OF_YEAR FROM create_on_ts) AS DayOfYear,
         COUNT(*) as groupCount,
         SUM(value) as totalValue,
         MIN(value) as minimumValue,
         MAX(value) as maximumValue
    FROM RandomData
    GROUP BY
         EXTRACT(YEAR FROM create_on_ts),
         EXTRACT(DAY_OF_YEAR FROM create_on_ts);

    -- 70ms

    Test B:

    -- 330 ms

    Times are great!  Starting to find out if he didn’t pre-cached Year, Day, Hour...

    UPDATE: Yes, in his test DB, he pre-computed date attributes. Here is his DDL: https://s3.amazonaws.com/padak-share/blog/voltdb-ddl.sql

    The hardware used: MacBook Pro (Intel i5 2.5 GHz processor - 2 cores, Memory 16 GB).

    VoltDB looks very interesting. I was just little bit curious that, for instance, you set up a database by using binary client in terminal, and it runs some things in Java code:
    voltdb compile -o random.jar random.sql
    voltdb create catalog random.jar
    csvloader randomdata -f randomData10.csv --skip 1

    It’s not my cup of tea for now, but I will be watching — in time it might be cool!

    HP Vertica

    When Jan Císař ran this test for me, Vertica was quite a challenge. It’s changed dramatically since then, and is now quite a nice tool.

    Preparations:
    SET TIMEZONE 'UTC';
    CREATE TABLE RandomData_T (
         created_on TIMESTAMP,
         value DECIMAL(22,20)
    );

    COPY RandomData_T from '/tmp/randomData.csv' delimiter ',' null as '' enclosed by '"' exceptions '/tmp/load.err';
    Time: First fetch (1 row): 121s. All rows formatted: 121s

    #warming up ... :)
    SELECT * FROM RandomData_T LIMIT 10;

    Test A

    SELECT
         EXTRACT(YEAR FROM created_on),
         EXTRACT(doy FROM created_on),
         COUNT(*),
         AVG(value),
         MIN(value),
         MAX(value)
    FROM RandomData_T
    GROUP BY
         EXTRACT(YEAR FROM created_on),
         EXTRACT(doy FROM created_on)
    ORDER BY
         EXTRACT(YEAR FROM created_on),
         EXTRACT(doy FROM created_on);

    --  Time: First fetch (366 rows): 2068ms

    Test B

    SELECT
         EXTRACT(YEAR FROM created_on),
         EXTRACT(doy FROM created_on),
         EXTRACT(HOUR FROM created_on),
         COUNT(*),
         AVG(value),
         MIN(value),
         MAX(value)
    FROM RandomData_T
    WHERE
         created_on BETWEEN
         TIMESTAMP '2012-07-16 00:00:00'
         AND
         TIMESTAMP '2012-07-16 01:00:00'
    GROUP BY
         EXTRACT(YEAR FROM created_on),
         EXTRACT(doy FROM created_on),
         EXTRACT(HOUR FROM created_on)
    ORDER BY
         EXTRACT(YEAR FROM created_on),
         EXTRACT(doy FROM created_on),
         EXTRACT(HOUR FROM created_on);

    --   Time: First fetch (2 rows): 26ms

    Jan used AWS instance type 4xlarge, approximately 30GB of RAM, SSD disks and 4x Intel Xeon E5-2680

    Elasticsearch

    Around New Year I put a teaser about these tests on my Facebook, and Karel Minarik called me to say he would do this test in ElasticSearch. I was very excited, and here is the result. Executive Summary — it’s fast as hell! It’s quite complicated to get to the point when you can actually query, and it takes a while to import.  For me it is actually too difficult to achieve it thanks to lots of Ruby code. 

    Karmi's results are here.

    Pure aggregation: 16.3s
    Aggregation + filter: 10ms

    GoodData

    Sure, it does not try to tackle the milliseconds differences, but there is no match for its ease-of-use when producing results.

    I uploaded data from  S3 into Keboola Connection (about as difficult as sending an email with an attachment). I told  the Keboola Connection how to import it into GoodData. The preparation inside the GoodData project is very simple — I’ve appointed the first column as date (with time) and the second to be a number.

    Click the “Upload Table”  and Keboola Connection prepares everything else. It creates a physical and logical data model inside GoodData, and it parses and exports data into a format that GoodData imports. To get you excited, I have included the link with the communication log with GoodData API.  All of this is hidden behind one button for the end user, or one API call.

    We deliver the demonstration and GoodData crawls through it, preparing the data accordingly and importing it into a BI project. I advise everyone to think about it as a compact table that you see on your end — the input.  Although all of this is actually placed into the columns which link to each other and appear like “snow flakes”.  Load data (after parsing it inside Keboola Connection the table has approximately 4GB) takes around one hour.

    Tests

    For the first test — aggregation over 50 million lines — you need to create four metrics:
    1. counts the number of records [ COUNT(Records of randomData) ]
    2. calculates the average value [ AVG(value) ]
    3. calculates the minimal value [ MIN(value) ]
    4. calculates the maximum value [ MAX(value) ]

    and you "look at them ” through the year, and the day in the year.

    To conclude the second test you just add the filters.

    Both of the tests are in the screencast below. The footage is not edited on YouTube, and you can see how instinctive and fast it is. Yes, it’s possible to measure the time it takes to calculate the report, but that’s not the point.  The point is to show how EASY it is compared to other approaches.


    I thought I could make it a bit more complicated and show you how to create "By how many % did the aggregated count of records change day by day". Again, here you can find the unedited video: 

    GoodData offers you a link with every report “explain” v DB" which shows what needs to be done for the actual query. When you chart it, it looks like this:


    Postgres

    Jan Winkler did Test A and B on PostreSQL (9.3.4). Details are here: http://goo.gl/xt3qZM
    • Import tooks about 2 minutes
    • Test A (no indexes, no vacuum): 33.55 s
    • Test B (no indexes): 4.5 s
    • Test B (with indexes on created_on column, first run): 0.06 s
    • Test B (with indexes on created_on column, second run): 0.015 s

    MySQL

    Finally, I did same tests in MySQL. Server has 64GB RAM, SSD discs...

    Test A

    SELECT
         YEAR(`created_on`),
         DAYOFYEAR(`created_on`),
         COUNT(*),
         AVG(value),
         MIN(value),
         MAX(value)
    FROM `randomData`
    GROUP BY 1,2
    ORDER BY 1,2;

    -- 46sec

    Most of the time was used for creating the tmp table:

    Test B

    SELECT
         YEAR(`created_on`),
         DAYOFYEAR(`created_on`),
         HOUR(`created_on`),
         COUNT(*),
         AVG(value),
         MIN(value),
         MAX(value)
    FROM `randomData`
    WHERE
         `created_on` BETWEEN '2012-07-16 00:00:00' AND '2012-07-16 01:00:00'
    GROUP BY 1,2,3
    ORDER BY 1,2,3;

    -- 0.022sec
    Summary to MySQL: it's optimizer is crap :) Hynek Vychodil did get down Test A to 33sec (http://goo.gl/7qKsP4).

    A final few words ….

    Importing the data always poses different degrees of difficulty, so I don’t include them in my evaluation. But I can say the final aggregation of data for the end user has always been very fast. 

    If you know how to use SQL and you just need a few queries, BigQuery should be your choice.  

    If you want to ask 'zillions' of business questions and not take care of your own DB cluster, then Amazon Redshift is great.

    If your data doesn’t have a very firm data structure, ElasticSearch is perfect (Karmi told me there is a guy from Germany who pours more than 1TB of data into Elastic every day and he has no problem with speed).

    If you want to process the data and/or you have a more complicated query structure, and somebody will be asking lots of business questions, than I believe you should go with Keboola Connection and GoodData.


    Why is GoodData special?

    Today's world is oversaturated with data. Telling stories through data is beginning to be so sexy, that many people are building their career on it. A few semi-experts in the Czech Republic have even changed their colours and started talking about BigData (in a worst-case scenario, they also hold conferences on this topic). However, I'll save this topic for a future blog post, in which I'll ground their Hadoop enthusiasm a bit for you.

    Data...

    People want to know more about the environment in which they operate. It helps them make better decisions, which usually leads to a competitive advantage. Generally, for good decision making we need a combination of three things: proper input parameters (information / data), common sense / experience, and a modicum of luck. However, an idiot will still be stupid and although luck can be occasionally bought in the Czech Republic, there's the threat of being arrested for bribery. Hence why information remains the most influenceable component of success. At my playground, the correct information serves as answers to your most penetrating questions you can think of.

    I assume that each of you knows how much money you have in your personal bank account. Most of us will also know how much money we spend per month. Fewer of you will know exactly what it was for. An even smaller group of people will know the structure of all the pleasant cups of coffee, ice cream, wine, lunches and so on (we call it long tail). I would bet that almost no one knows one´s personal annual trend in the cost structure of such long tail. You’ll probably argue that you don't care. If you're a company that wants to succeed, you can't do without such information. As for one's own personal life, the biggest nutcase, as I see it, is Stephen Wolfram, who has been measuring almost everything since 1990. He wrote about almost everything except lint from his bellybutton (unlike Graham Barker :)

    Because there’s nothing about the executive summary of your accounting on CRM, Google Analytics or social networks on TV after the evening news, you're forced to build different variants of reports and dashboards yourselves.

    I'll try to summarize the tools which I know are available; but in the end I'll tell you that it's all just a toy gun, and whomever wants a proper data gun must reach for GoodData. To be fair, I'll do my best and argue a bit :)

    Excel

    Today, Excel is on every corner. It's a good helper, but quite a lot of people have a strange tendency to make Excel Engineers of themselves, which is the most dangerous expertise you can come across. The Excel Engineer often ends with a contingency table and SUMIF() formula. At the same time, business data processing is interconnected with him and, perhaps unwittingly, he's becoming a brake on progress. The biggest risks of reporting in Excel, in my opinion, are as follows:

    1. The primary data, from which reports are made, are stored in Excel; these data were imported by someone into Excel at some point - with both a poor/expensive possibility of updating
    2. Excel sheets tend to travel around Corporate Outlooks, leading to different versions. It often comes in handy to change YTD% a little bit, or it may easily happen that another department has the same Excel, but with different numbers - it undermines confidence in the reports and can easily allow for distortion of reality
    3. Complicated reports must be created by the reporting department (only they know how to update the data - see point 1), where Excel experts provide answers to business questions they do not always understand. Therefore, it often happens that the ad-hoc responses to your ad-hoc hypotheses have been forming for days (submitter burnout occurs)
    4. The combination of manual operations and macros made by the one that doesn't work here anymore, introduces errors to Excel, thanks to which the cosmoverse then collapses!

    It's probably obvious that Excel reporting should end at the level of a sole trader. Nothing reliable can be created with it efficiently. You can be sure that the Excels on ZEE (network drive, of course!) contain errors, are not up-to-date and were made ​​by people who were assigned to it by someone else, so they knew damn all about the nature of the data they involved into the VLOOKUP! Excel Engineers usually don't have it in their genes to do data discovery, and even if they came across something interesting, they probably wouldn't notice. You will know best what the correct information is at that very moment (and Excel isn't really what you should master in 2013 at the level of VBS macros and dirty hacks)!

    Visualization

    Today's market is oversaturated with tools that aim to help you visualize some kind of business information. Imagine business information as the number of orders per today, the net margin for the last hour, the average profit per user, etc. In the majority of cases it works in the following way: you calculate this information on your side and send it automatically via an interface to a service that ensures the given metric is presented. Examples of such services are Mixpanel, KissMetrics, StatHat, GeckoBoard and even KlipFolio. The advantage compared to Excel lies especially in the fact that the reports and dashboards can be easily automated and then shared. Information sharing is quite underrated! An example of such information could be the number of data transformations that are executed in minute granularity at our staging layer:

    You can build Dashboards from these reports, and for a while you well feel good. The problem occurs when you find out that any extension of such a dashboard requires intervention from your programmers, and the more complex your questions are, the more complicated the intervention. If you operate in B2C and have transactional data, you can be sure that the clinical death of this form of reporting will be, for example, a query on the number of customers with time that spent at least 20% more than the average order for the previous quarter, and all of whom at the same time bought an ABC product this month for the first time. If your programmers, by luck, manage to implement it, they'll shoot their heads off once you add that you want daily numbers of TOP 10 customers from each city who meet the previous rule. If you have just a few more transactions, it will mean remaking your existing DB on your side, which will eventually lead to a 100% collapse. Even if you try to make it survive at all costs, you can be sure that you won't slay the competition thanks to that zero flexibility - you won’t even be able to gently take the analytical helm because the market will pivot around you.

    It's possible you don't have similar questions about your business, and it doesn't bother you. The cruel truth is that your competition is asking about it right now, and you will have to somehow respond to it...

    Pseudo BI

    Neither Excel nor visualization tools usually have any sophisticated back-end, which applies similarly to services like Domo or Jolicharts. They look super sexy at first glance, but inside is a masked set of visualization tools, sometimes coated with a few statistical features that you mostly won’t use. The common denominator is the absence of a language with which you could step out of the predefined dashboards, and begin to implement similar services so that they were to your benefit.

    Their only advantage is that they can be quickly implemented. Unfortunately, that's it, and after a short intoxication period, sobriety sets in. If by chance you are a little bit more demanding, you haven't got a chance for a very happy life.  

    Low Level Approach

    There are services that allow you to upload data and raise queries. As I see it, nowadays the hottest is Google BigQuery. For us at Keboola, it's a tremendous help with data transformation, denormalization and JOINs of huge tables. It can serve you well if it seems like a good idea to you to write the following:

    ...to get this...:

    It's evident that if you don't make a living as an SQL consultant and don't have any ambition to create your own analytical service, you’d better leave this approach to nerds (like us!) and attend to your own business :)

    Cloud BI

    If you google cloud BI, Google will return names like Birst, GoodData, Indicee, Jaspersoft, Microstrategy, Pentaho, etc. (if you have Zoho Reports among the results, the universe got crazy because that should have remained in Asia :).

    From many trends, it's obvious that the Cloud moves the world of today. In the Czech Republic, the most common concern about the concept is a worry about the data and the feeling that my IT can do something better than that of the vendor. If you feel the same concerns, you should know that when any troubles arise in the Cloud, the best people available on this planet are working on it immediately, so that everything will again run like clockwork. Dave Girouard (coincidentally also a board member of GoodData) summed it up nicely in this article.

    Except for Microstrategy, which probably discovered the Cloud this morning, the above-mentioned brands are relatively established within the Cloud. However, there are different surprises hiding under the lid. Pentaho requires highly technical knowledge so that you can make the most of it. Jaspersoft is Excel on the web that, in short, failed. Indicee would like to play in the major leagues, but I know at least one large customer from Vancouver who, after trying to implement their solutions for a year, moved to GoodData. When I tried Birsta it was all in Flash, and despite my enormous effort I really didn't understand it :(

    As I said in the beginning, everything except GoodData sucks. There are several reasons for this:

    1. GoodData has a powerful language for the definition of metrics. With this language, it's possible for anyone to generate reports, no matter how complicated. The fact that these reports are created not only by clicking is more than essential - it gives you the flexibility you'll need to fight for first place against your competition. If GoodData satisfies Tomáš Čupr (ex-Slevomat, DámeJídlo.cz), you can be sure it will suit you as well. Also constructs which are complex at first glance can be quickly learned at the Keboola Academy.
    2. GoodData, unlike its competitors, has fundamentally designed API interface to enable companies such as Keboola to bend the whole analytical platform so that it plays first violin in your environment. Seamless integration with other information systems, white-labeling, single-sign-on and a framework for data extraction and transformation means that there are no compromises during the implementation.
    3. GoodData aren't just reports in a web browser but an entire set of abstractly separated functional layers (from a physical model representing the data up to a logical model representing the business relationship), thanks to which the implementation doesn't include things like, for example, a feasibility study or technical specification. In comparison with the competition, GoodData can be implemented with tremendous speed (no projects for months).
    4. GoodData has a phantom lab in Brno where R&D is taking place, the output of which are innovations which I'm not sure I can make public today. Nevertheless, I can honestly say that the others will soon shit their pants from it. I'll definitely add it here in time!

    All in all, the quality of GoodData shows, among other things, a lot of connections, such as Zendesk.com (the biggest service to support customers in the world). The ability of such flexibility is, from my point of view, absolutely essential for future success. Any one of you can rent high-performance servers, design super-cool UI or program specific statistical functions (or perhaps borrow them from Google BigQuery), but in the foreseeable future no one will come out with a comprehensive concept that makes sense and is applicable to small dashboards (we have a client who uses GoodData to look at some data from Facebook Insights) as well as gigantic projects with a six-digit $ budget for just the first implementation phase.

    GoodData Rocks! 

    Howgh!