Click to learn more about author John Ottman.
The goal of digital transformation remains the same as ever – to become more data-driven. We have learned how to gain a competitive advantage by capturing business events in data. Events are data snap-shots of complex activity sourced from the web, customer systems, ERP transactions, social media, IoT, streaming, and even from machine-generated data. By collecting and processing event data in real-time, managers gain situational awareness to make better decisions.
Data-driven applications enrich our understanding of business events because they leverage more data. To accomplish this, next generation apps that incorporate machine learning (ML) and artificial intelligence (AI) require schema flexibility and the ability to process very large amounts of data affordably. The goal is to raise the bar on a ‘single version of the truth’ and create advanced processes to improve business outcomes.
Event Data Capture
Event data improves visibility. Enterprise data warehouses that rely on canonical, top-down schemas often fail to describe business events adequately. For example, customer order transactions are sorted and analyzed, but what else do we know about these events? Was the customer referred and by whom? A mobile app, Web or retail customer? What else might the customer want to buy? Event data capture builds context and enables better decisions with more predictable results.
Event data capture may also involve large scale data collection. Structured data from transaction systems provides only a partial picture. Today, up to 80% of enterprise data is unstructured or semi-structured and includes images, email, social media, audio and video. To establish a ‘single version of the truth’ for a particular business event, data collection includes all available data about that event including structured data, files, streaming data, machine logs and raw files.
You might be thinking, “that sounds like a lot of data!” Scalability is usually referred to in simple terms such as how many petabytes can we support. Yet simple bulk scaling to petabytes often results in massive systems that become so big they are less usable. Petabyte file stores become inefficient when you are seeking fine grain results. But when we scale logically to more discrete and specific namespaces, we can describe data better, and processing can be optimized more effectively. Therefore, the scalability challenge has evolved from how many petabytes can we support to how many namespaces can we manage.
Cloud Information Architectures
With so many infrastructure requirements changing, the data-driven enterprise requires a new information architecture to achieve digital transformation. This new information architecture ingests any data, uses object storage to store bulk data at the lowest cost, and scales horizontally on clusters of commodity infrastructure. And of course, the architecture must be real-time since data loses value so fast as it ages.
The rise of multi-cloud, data-first architecture and the broad portfolio of advanced data-driven applications that have arrived as a result require cloud data management systems to collect, manage, govern and build pipelines to channel enterprise data. Cloud Data Architectures span private, multi-cloud and hybrid cloud environments connecting with transaction systems, file servers, the Internet and multi-cloud repositories.
Cloud data platforms are the centerpiece of cloud data management programs, and they manage uniform data collection and data storage at the lowest cost. Archives, data lakes, and content services enable cloud migration projects to connect, ingest, and manage any type of data from any source including legacy systems, mainframes, ERP and even SaaS environments like Salesforce or Workday which have become the new systems of record.
Data migrated to the cloud is often stored “as-is” in buckets to reduce heavy lift ETL processes. The goal is to establish real-time data pipelines to support data-driven applications. When “as-is” data will not meet application requirements, enterprise data lakes are used to cleanse and transform raw data in preparation for future processing. Data preparation provides critical data quality measures including data profiling, data cleansing, data transformation, data enrichment and data modeling.
Data Pipelines and Metadata Management
Data pipelines are a series of data flows where the output of one element is the input of the next one, and so on. Data lakes serve as the collection and access points in a data pipeline and are responsible for access control. As data pipelines emerge across the enterprise, enterprise data lakes become data distribution hubs with centralized controls to federate data across networks of data lakes. Data federation centralizes Metadata Management, Data Governance and compliance control while at the same time enabling decentralized data lake operations.
Metadata Management provides a view of the entire data landscape (including structured, semi-structured, and unstructured data) and helps users understand their data better. Analysts classify, profile and establish consistent descriptions and business context for the data. Metadata Management enables users to explore their data landscape in three ways:
Data lineage helps users understand the data lifecycle including a history of data movement and transformation. Data lineage simplifies root cause analysis by tracing data errors and improves confidence for processing by downstream systems.
Data catalog is a portfolio view of data inventory and data assets. Users browse the data that they need and are able to evaluate data for intended uses.
Business glossary is a list of business terms with their definitions. Data Governance programs require that business concepts for an organization be defined and used consistently.
Cloud Data Management for Compliance
Cloud Data Management also provides consumer data privacy and Data Governance controls that are essential to reduce the risks involved in handling bulk data. Information Lifecycle Management (ILM) manages data throughout its lifecycle and establishes a system of controls and business rules including data retention policies and legal holds. Security and privacy tools like data classification, data masking and sensitive data discovery help achieve compliance with Data Governance policies such as NIST 800-53, PCI, HIPAA, and GDPR. Consumer data privacy and Data Governance are not only essential for legal compliance, they improve data quality as well.
CIOs committed to digital transformation should start with a data-first architecture to successfully interoperate with the cloud and its vast network of data and web services. The goal is to describe business events better using data pipelined from OLTP systems, file stores, databases and mail servers. Whether hosting data lakes, enterprise archives or running NoSQL applications, data-first architecture requires Cloud Data Management to deliver the essential services for successful data-driven applications.