The Economist Intelligence report Big data evolution: forging new corporate capabilities for the long term published earlier this year provided insight into big data projects from 550 executives across the globe. When asked what their company’s most significant challenges are related to big data initiatives, maintaining data quality, collecting and managing vast amounts of data and ensuring good data governance were 3 of the top 4 (data security and privacy was number 3.) Data availability and extracting value were actually near the bottom. This is a bit surprising as ensuring good data quality and governance is critical to getting the most value from your data project.
Maintaining data quality
Having the right data and accurate data is instrumental in the success of a big data project. Depending on the focus, data doesn’t always have to be 100% accurate to provide business benefit, numbers that are 98% confident is enough to give you insight into your business. That being said, with the sheer volume and sources available for a big data project, this is a big challenge. The first issue is ensuring that the original system of record is accurate (the sales rep updated Salesforce correctly, the person filled out the webform accurately, and so forth) as the data needs to be cleaned before integration. I’ve personally worked through CRM data projects; doing cleanup and de-duping can take a lot of resources. Once this is completed, procedures for regularly auditing the data should be put in place. With the ultimate goal of creating a single source of truth, understanding where the data came from and what happened to it is also a top priority. Tracking and understanding data lineage will help identify issues or anomalies within the project.
Collecting and managing vast amounts of data
Before the results of a big data project can be realized, processes and systems need to be put into place to bring these disparate sources together. With data living in databases, cloud sources, spreadsheets and the like, bringing all the disparate sources together into a database or trying to fuse incompatible sources can be complex. Typically, this process consists of using a data warehouse + ETL tool or custom solution to cobble everything together. Another option is to create a networked database that pulls in all the data directly, this route also requires a lot of resources. One of the challenges with these methods is the amount of expertise, development and resources required. This spans from database administration to expertise in using an ETL tool. It doesn’t end there unfortunately; this is an ongoing process that will require regular attention.
Ensuring good data governance
In a nutshell, data governance is the policies, procedures and standards an organization applies to its data assets. Ensuring good data governance requires an organization to have cross-functional agreement, documentation and execution. This needs to be a collaborative effort between executives, line of business managers and IT. These programs will vary based on their focusbut will all involve creating rules, resolving conflicts and providing ongoing services. Verifications should be put into place that confirm the standards are being met across the organization.
Having a successful big data project requires a combination of planning, people, collaboration, technology and focus to realize maximum business value. At Keboola, we focus on optimizing data quality and integration in our goal to provide organizations with a platform to truly collaborate on their data assets. If you’re interested in learning more you can check out a few of our customer stories.