Click to learn more about author Pete Aven.
Data Wrangling is bad. Yet we all do it, every single day. In a nutshell, Data Wrangling , also known by the more technical term of “data munging,” is the process or transforming data from one shape into another to prepare it for analysis and deliver some unified results.
The practice has been normalized over time and is becoming more deeply entrenched in our behaviors and within our organizations. We must remember it is nothing to be embraced or accepted – it is indicative of a time when the volume and variety of data is greater than our ability as humans to keep up with it. It’s a horrible practice, and we need to find ways to stop. Better Data Management solutions can help.
The Problem with Data Wrangling
Why do we wrangle data? Because we need to unify information from disparate sources, be they databases, spreadsheets, applications, or filesystems, into some single view or report. Walk into any retail store and make a purchase. Chances are data about you and your visit will be stored across multiple systems. Each of these sources has some information about the entities and relationships which a business cares about, such as customers, addresses, transactions, products, brands, etc. There is a lot of overlap in the data, but the structure of the sources for each has a different shape and labeling. Each source also has context specific attributes the business cares about, and wants to unify from across sources, so they can create a single view of a customer or product.
For example, the “purchase date” column name variations across sources may include:
- purchaseDate
- transaction_date
- txDate
- prchsdt
And the values themselves are likely not rationalized:
- 6-20-2018
- 06/20/2018
- 20-JUN-2018 08:03
- 20/06/18
And so onward we wrangle. To unify the data into a cohesive, unified structure with normalized values we press on to varying degrees in excel sheets, BI tools, code, and ETL (Extract, Transform, Load) so we can ask a simple question for a report such as, “What purchases were made on June 20, 2018?”
But Data Wrangling demands payment! The penalty is withdrawn in the very real tax it puts on all of us in mental stress, hours of effort, errors, and non-repeatable reporting. The percent of Data Wrangling hours per employee is a leading indicator of corporate inefficiency — traceable to its impact on the bottom line.
The Problem is Getting Worse
Contributing to this problem is the application-centric focus of businesses. Each application has its own requirements for communicating with data and really becomes a database unto itself. Cloud has made it simpler, less expensive, and quicker than ever to deploy new applications. This is the good news and the bad news.
Businesses generally require data from across applications to be shared for reporting, but in an application-centric approach, they are not. A data-centric approach to Data Management, to provide an organizational view of data, is required, but getting there is a process, and a topic unto itself.
Towards Simpler Solutions
To start connecting data, a business view of data is required. However, there is a fundamental disconnect between how those in business think of data and how those in technical think of data.
To slow the proliferation of silos, and rapidly connect data from disparate sources, flexibly and incrementally, a graph database of some sort is absolutely required. While the choice of database is important, that’s not where the data discussion should begin. It’s an implementation detail.
If you start a meeting to gather business requirements with a blinking cursor or ERD diagram or a database, you get a technical discussion that disenfranchises business from further participation. If you only talk to IT (or any one group) about the solution, your perspective will be skewed. You need the right people in the room and that includes non-technical leadership.
The right application can help bridge the gap so that a conceptual Organizational View of Data can be drawn by business and IT together in a visual whiteboard interface.
This conceptual model can capture the requirements of business and then be made physical and even connected to a larger graph of information by mapping source data to the model. Because the conceptual is made physical, there is no misunderstanding in requirements and no miscommunication between the abstraction layers of conceptual, logical, and physical that has to be translated across for users.
Data can be stored in a format that is understandable to humans along with its metadata and information about the systems it resides on along with any other relevant contextual information in a single, cohesive, unified Data Fabric. Done right, this fabric doesn’t become a silo unto itself, but the connective tissue between disparate sources.
A Data Fabric doesn’t replace silos. Many silos exist for very good reasons, be they technical, functional, geographic, or for security reasons. What’s missing between them is the bridge to provide a unified view and context. A Data Fabric augments data silos using a graph that is amenable to update and enrichment, without requiring the toll traditional Data Wrangling across tabular sources requires.
Despite graph’s growing success, there still isn’t a lot of graph expertise out there. And starting with a database to connect data, you essentially begin with a blinking cursor. But there are well known patterns to the silos that are proliferating, and patterns emerging from experience for how we can best connect data from across these disparate sources. The emergence of end-user data prep solutions and visual modeling applications are taking advantage of these patterns to simplify how we interact with data and graphs and invite business and non-technical users to participate in the data discussion.