How can an organization thrive in the 2020s, a changing and confusing time with significant Data Management demands and platform options such as data warehouses, Hadoop, and the cloud? Trying to save money by bandaging and using the same old Data Architecture ends up pushing data uphill, making it harder to use. Rethinking data usage, storage, and computation is a necessary step to get data back under control and in the best technical environments to move business and data strategies forward.
William McKnight, President of the Data Strategy firm the McKnight Consulting Group, offered his advice about the best data platforms and architectures in his presentation, Databases vs. Hadoop vs. Cloud Storage at the DATAVERSITY® Enterprise Analytics Online Conference. McKnight explained that today’s Data Management needs call for leveling up to technology better suited to obtaining all data fast and effectively. He said:
“Getting all data under control is the thing that I say frequently. It means making data manageable, well-performing, available to our user base, believable, advantageous for the company to become data-driven.”
Handling data well has become especially crucial for the future, a future where artificial intelligence (AI) augments business analysis and permeates operations. To work successfully, AI must have good Data Quality to train and test and use. Furthermore, this data needs to cover all types, not just the typical static tables and reports generated from Microsoft Excel. Dynamic data from call center recordings, chat logs, streaming sensor data, and other sources play a fundamental role in supporting AI initiatives and business needs.
Leveraging AI and data involves looking beyond what business reports exist now to why they exist and how different data types – including semi-structured and unstructured data – can enhance results. Companies take this next step by assessing how their Data Architecture and technical programs do with utilizing data. McKnight stresses, “I’ve seen this time and time again: firms overpaying for data because it is in the wrong platform.” Moving data into the right environments for better manipulation entails understanding a variety of technical solutions and how to fit the right ones onto an enterprise’s Data Architecture.
Three Major Decisions
McKnight recommends making three significant decisions when considering a data platform for a Data Architecture:
- Data Store Type: Enterprises choose between two data storage options: databases and file-based scale-out system utilization. Databases, especially relational ones, thrive with organized data. Relational database architecture makes up over 90% of business data solution purchases. File-based systems, like Hadoop, do better preserving big data, which includes unstructured and semi-structured data.
- Data Store Placement: Once a company chooses its data storage platforms, it needs to find a place to put them. Options include on-premise or in the cloud, where third-party vendors host company information in their data centers. In the past, most enterprise data has typically lived on site. But as data quantities keep growing exponentially, the cloud – especially the public cloud – can scale business data better off-site with less expense.
- Workload Architecture: Data requests vary. Firms need real-time data for business operations and short, frequent transactions like sales and inventory. Companies also require post-operational data to analyze opportunities and forecast and guide executive decision making. Analytical workloads often result in longer, more complex queries requiring a very different kind of Data Architecture than operational tasks.
Controlling Data with Both Data Warehouses and Big Data Technologies (Hadoop)
McKnight argues that both data warehouses and Hadoop need to factor into a company’s Data Architecture. Many firms understand the value of organizing data using relational database technologies. Data warehouses represent a must-have for a mid-size or large company because they provide a shared platform standardizing enterprise-wide data. Furthermore, warehouse data can be searched, reused, and summarized in addition to saving the cost of reconstructing the same schema time and again. But firms also need to consider new unstructured and semi-structured data types, which require big data architectures like Hadoop.
Businesses will want big data platforms for their data science and artificial intelligence projects, among others. Data lakes and Hadoop perform better, faster, and cheaper with large amounts of broad enterprise data. Businesses may discount some of these newer data types, but some use cases demand them, including marketing campaigns, fraud analysis, road traffic analysis, and manufacturing optimization. Unstructured and semi-structured data has become a necessity, making Hadoop (and other data lake constructions) and data warehouses a business requirement.
Analytic Databases and Data Lake Storage in the Cloud
After choosing a data store type, businesses need to figure out a place to keep the data. McKnight sees full data life cycles in the cloud as a business necessity to leveling-up Data Management, mostly through analytic databases and data lake storage.
McKnight has found, from twelve benchmark studies published in the last year, that analytical databases perform better in the cloud. He explained other cloud analytical database benefits, too:
“The cloud now offers attractive options, SQL robustness and better economics (pay-as-you go), logistics (streamlined administration and management), and scalability (elasticity and the ability for cluster expansion in minutes).”
Cloud analytical databases have a more straightforward and flexible architecture that keeps up better with dynamic data at a lower cost.
In addition to putting analytical databases in the cloud, businesses benefit from keeping data lakes as cloud object storage. Cloud object storage sets discrete data units together in a non-hierarchical environment. This technology scales persistently and compresses data better than an on-premise data center, reducing data lake storage costs. Furthermore, data lakes that leverage cloud object storage separate ‘compute’ and ‘storage’ better, improving performance and the ability to tune, scale, or interchange compute resources.
Not all data belongs in the cloud. For example, data queries and certain types of databases work better onsite. While data lakes and Hadoop show better performance as storage, they retrieve data better on location through the Hadoop Distributed Files System (HDFS). In McKnight’s experience, HDFS has two to three times better query performance than from the cloud. Furthermore, Hadoop requires some workarounds that can be better addressed on-premise. So, placement onsite has some value, depending on the business needs.
Balancing Operational and Analytical Workloads
While data store types and placements play significant roles in choosing a platform, different workloads also require different architecture. Operational activities tend to happen dynamically in real-time to keep the business running. They require very high performance. On the other hand, analytics needs fast, complex, and intricate queries to retrieve high-quality information, helping business leaders make better decisions. Analytical tasks require information searches to run quickly and thoroughly.
In both cases, data warehouses make operations and analysis more efficient and capable. McKnight says, “Matter of fact, one of the most important places you can put in a dollar, in terms of data management, is the data warehouse.” But, one data warehouse architecture no longer fits all.
Data warehouses specialize for particular areas, like customer experience transformation, risk management, or product innovation. Even then, independent data marts – subject-oriented repositories for specific business functions like finance or sales operations – may be necessary to augment workloads through a data warehouse. Analytical workloads need data warehouses with substantial in-database analytics, in-memory capabilities, columnar orientation, and modern programming languages. To have the best of many worlds, companies combine a few different data warehouses to best serve their business needs.
Not all operational and analytical workloads can be addressed by niche data warehouses, and big data technologies may be necessary for faster functional and analytical real-time performance. This can mean pairing a data lake with an analytical engine or looking towards a hybrid database that “processes both business orders and machine learning models simultaneously with fast performance and reduced complexity,” as McKnight says. So, big data technologies like Hadoop also play a significant role in spanning operations and analysis workloads, as also shown in graph databases.
Graph databases leverage a NoSQL environment to bridge entities and their properties through a network or a tree. A quick peek at a graph database can save time and energy otherwise spent on complex SQL querying and provide, as McKnight says, “non-obvious patterns in the data.” The advantage of graph databases, to McKnight, is that they display some information with more accuracy and better performance than the report generated by a data warehouse.
Organizations need to understand which data platforms manage different data workloads, placements, and types the best. McKnight emphasizes that businesses will survive and thrive when they figure out how to construct data warehouses, Hadoop, and cloud computing together, meeting their data and business strategy needs. Whether companies plan to purchase new technologies or use what’s on hand, finding an appropriate way to use these three tools together makes getting data under control more likely.
Want to learn more about DATAVERSITY’s upcoming events? Check out our current lineup of online and face-to-face conferences here.
Here is the video of the Enterprise Analytics Online Presentation:
Image used under license from Shutterstock.com