Gone are the days when customers would place an order and patiently wait for hours or even days for goods to be delivered, or when letters would travel through snail mail to reach their recipients. Today, businesses and individuals expect instant access to information and swift delivery of services. The same expectation applies to data, which has become a critical asset for businesses in making informed decisions. Therefore, organizations must ensure that information is not only available to users when needed, but also reliable and trustworthy. As a result, many are applying data pipelines, which are a series of steps that prepare enterprise data for analysis, to help. Composed of various technologies, data pipelines verify, summarize, and find patterns in data to help the business make better decisions.
Unfortunately, the emphasis on technology has led data professionals to lose sight of the original goal; meeting business needs. Many discussions about modern data stacks revolve around comprehensive architectures comprising a multitude of products that supposedly cater to business users’ requirements. However, this technology-first approach often results in suboptimal and expensive solutions that take a significant amount of time to build. Moreover, such approaches may lack sustainability in the long run.
Consequently, organizations are shifting toward a decentralized approach for developing data outcomes where the responsibility is shared with the business domains that possess a deep understanding of their data. This approach not only removes bottlenecks for central IT teams, but also increases accountability. However, becoming business-outcome-first requires a thorough understanding of what the business truly needs. At the very least, organizations need to meet certain minimum standards and expectations to enable effective decision-making, including:
- Creating high-quality and accurate data that can be trusted by business users
- Enabling personalized user experiences with self-service access to data
- Providing reliable data subsystems infrastructure that operates seamlessly
- Maintaining data privacy and security policies to comply with regulatory requirements
- Supporting high-performance data analysis for current and future use cases
- Adhering to cost estimates and providing transparency into the value created
While these requirements may seem straightforward, they pose significant challenges in practice. The current approach typically involves IT teams cobbling together complex architectures by integrating multiple software products. This becomes even more problematic when dealing with diverse data sources, processing tools, and consumption platforms spread across on-premise and multiple clouds.
The IT-centric approach frustrates business users who are now leading efforts to modernize their data infrastructure. While IT professionals debate the pros and cons of bundled versus unbundled approaches, business teams question the value, time, cost, and effort required to meet their needs. The lack of clear guidance on how to modernize exacerbates the confusion. However, recent developments are helping businesses establish strong data pipelines to address these challenges:
Time-to-value: Building data pipelines involves significant integration overhead due to the lack of industry standards among the involved products. This complexity and cost increases further as new Software-as-a-Service (SaaS) data sources emerge. To mitigate these challenges, organizations are adopting cohesive platforms that pre-integrate basic building blocks, reducing integration efforts and accelerating time-to-value.
Reliability: Pipelines composed of disparate products often lack transparency regarding data health as it moves from sources to targets. This results in brittle pipelines and a lack of accountability. To address this issue, the data observability category has witnessed a surge in product offerings. Data observability introduces proactive monitoring and alerting mechanisms to identify anomalies and ensure reliable data flows.
Quality: Inefficiencies in data infrastructure have led organizations to build data silos, perpetuating poor data quality. Manually fixing data quality issues downstream is no longer viable. Consequently, data mesh and data product approaches are gaining popularity, promoting domain ownership and shifting development responsibilities to business teams. This decentralization eliminates bottlenecks that typically occur within overtaxed data engineering teams.
Skills: Modern data infrastructures demand a diverse set of expertise, but the focus should always be on achieving business outcomes. Balancing automation for non-value-add tasks and leveraging human-in-the-loop approaches to maintain context is crucial. Additionally, new skills such as product management within data teams are becoming increasingly important.
Failure to address these challenges results in reactive data teams, poor developer experiences, and unnecessary risks and costs for organizations. Therefore, a proactive approach is needed to overcome these hurdles effectively.
Will the Best Approach Please Stand Up?
Determining the best approach is not a straightforward task due to the multitude of standards and approaches available. Some key considerations include:
- Best-of-breed vs. integrated: The debate between a centralized (bundled or integrated) and decentralized (unbundled or decoupled) approach is ongoing. An integrated approach has been prevalent in recent years but may lead to IT bottlenecks. On the other hand, the best-of-breed method offers specialized products but it comes with higher integration overhead. Organizations need to align with their corporate standards and guidelines to determine the most suitable approach.
- Proprietary vs. open platform: Proprietary solutions provide peace of mind and superior user experiences but often come at a higher cost. Open-source products offer lower license costs and benefit from community contributions; however, they may introduce unforeseen risks. The decision between proprietary and open platforms depends on an organization’s IT skills maturity and risk tolerance.
- Control vs. managed: Some organizations, especially heavily regulated ones, prioritize control over their IT assets and have skilled staff to manage advanced technologies. Others, particularly medium to small-sized companies, prefer managed services to reduce operational burdens. Modern architectures with numerous moving parts often require managed services for effective operation and debugging.
- No-/low-code vs. programmatic: Different roles within an organization require varying levels of coding capabilities. Data scientists often prefer programmatic access to raw data using specific technical languages, while data analysts may rely on curated data. Non-technical roles may opt for no/low-code tools to interact with data through a semantic layer. A hybrid approach that supports these varying needs is crucial for enabling different personas within an organization.
In light of these considerations, a hybrid approach that combines the best aspects of different options proves to be the preferred choice. Organizations can create a business-led intelligent data architecture platform that unifies data and metadata, facilitating faster development of data products.
This option allows for centralized data infrastructure and metadata discovery while enabling decentralized development. Metadata use cases, such as data quality and observability, are also given due attention from the outset. Ultimately, these intelligent data architecture platforms empower business users by providing timely and trustworthy information while ensuring data security and trust.
To truly leverage data to its fullest and create a solid and trusted data pipeline, organizations must recognize the importance of delivering it at the speed expected in today’s fast-paced world. By embracing a business-outcome-first approach, and adopting intelligent data architecture platforms, organizations can overcome challenges, accelerate time-to-value, improve reliability and data quality, and effectively leverage their data assets when needed to achieve a competitive advantage.