Click to learn more about author Dr. Jans Aasman.
There’s a general consensus throughout the data ecosystem that Data Preparation is the most substantial barrier to capitalizing on data-driven processes. Whether organizations are embarking on Data Science initiatives or simply feeding any assortment of enterprise applications, the cleansing, classifying, mapping, modeling, transforming, and integrating of data is the most time honored (and time consuming) aspect of this process.
Approximately 80 percent of the work of data scientists is mired in Data Preparation, leaving roughly 20 percent of their jobs to actually exploiting data. Moreover, the contemporary focus on external sources, Big Data, social and mobile technologies has exploded the presence of semi-structured and unstructured data, which accounts for nearly 80 percent of today’s data and further slows the preparation processes.
Although the Data Preparation problem is readily apparent, the solution is much less so. Several self-service and automated options populate the marketplace, with varying degrees of effectiveness. The sustainable answer to this issue, however, is more straightforward: standardization. By simply standardizing their data upon point of ingestion with uniform data models, vocabularies, and terminology, organizations minimize Data Preparation work by doing it upfront, instead of redoing it every time new sources are added or requirements change.
The Knowledge Graph approach harmonizes data in this fashion to deliver an impressive range of benefits in which organizations maximize the value of their data scientists and data engineers, increase their job satisfaction, and induce an organization-wide shift in focus from preparing data to consuming data for informed action.
Simplified Modeling
The fundamental way Knowledge Graphs accelerate Data Preparation is by harmonizing data according to standard data models. Conventional preparation processes involve an inordinate amount of attention to schema—both that of the original source data and that of the downstream applications or analytics tools. Variation in schema results in lengthy modeling periods, complex mapping procedures, and dilatory transformation and integration. Knowledge Graphs simplify these measures by harmonizing all data formats and schema to standardized models applicable to structured, semi-structured and unstructured data. These models expand to accommodate additional source data or business needs for enterprise uniformity of schema.
Consequently, transformation and mapping is more holistic, quicker, and automatable; Data Integration is implied by aligning this information on Knowledge Graphs via these models. Organizations can therefore integrate more data sources faster than they previously could for more comprehensive Data Discovery relevant to business use cases. Social media sentiment is readily modeled alongside data from financial reports or spreadsheets, for example, to influence trading decisions. These standardized ontologies reduce the complexity of what’s perhaps the most exacting facet of Data Preparation—harmonizing data for application or analytics consumption—by facilitating uniform procedures for mapping, transforming and integrating data.
Classifications and Terminology
The other way Knowledge Graphs expedite Data Preparation is by standardizing data classifications according to departmental or enterprise spanning vocabularies and taxonomies. Thus, organizations ensure a consistency of meaning for the various business concepts associated with data, which in turn is reinforced by the deployment of common data models. Moreover, that meaning is based on business terminology and understanding of data in relation to defined business purposes.
In healthcare, for example, these uniform taxonomies and vocabularies standardize all of the various billing codes and medical procedures to which they relate. In e-commerce or retail, for instance, they’re applicable to various products, services, and outcomes of customer actions such as purchases, returns, helpdesk interactions, etc.
All of these types of data are semantically tagged so they’re easily classified and understood by users. Traditional preparation measures are considerably slowed by the classification process in which data scientists or engineers frequently consult business users to understand data’s meaning. By standardizing the words associated with business concepts, organizations can hasten the classification process with departmental or organizational taxonomies. Therefore, data scientists and engineers spend less time clarifying or trying to understand data’s meaning, which both improves and quickens steps for Data Quality—one of the desired aims of effectual Data Preparation.
Liberating Data Science
More than other aspects of big data, the time-consuming nature of Data Preparation directly correlates to the variation of data involved in the majority of today’s largely unstructured, external data sources. Semantic standards effectively nullify that variation for the IT systems storing and deriving action from that data. More accurately, they harmonize those points of distinction with singular data models and terminology measures for a consistency of meaning across the enterprise.
The Data Preparation ramifications of these standards are significant, particularly for Data Science, cognitive capabilities and Machine Learning. Standardizing data not only decreases the time spent engineering data for such use cases, but also increases Data Quality metrics for more accurate predictions. Semantic standards transfigure data science’s focus from preparing data to the actual science of using data, positively impacting this discipline as a whole.