Click to learn more about author Joe deBuzna.
In the world where IoT, AI, blockchain and Cloud-connected devices are redefining everything from energy and finance to supply chains and services, this is the new reality: Companies across all industries are becoming data companies. As a result, tools for managing data workloads, like modern Data Lakes, have become a means to additional revenue streams. But monetizing data isn’t as simple as adding the information to a Data Lake and putting up a Craigslist ad.
Despite the excitement and promise of new data-related revenue streams, companies are having a difficult time adopting new technologies: They are swimming in a growing pool of options as they work to modernize their organizations with the cloud and evolve their data infrastructures.
There is a litany of questions companies will ask about how to get that information into a useful state and location. If an insurance company buys data from a car manufacturer, is that information structured in a way that easily collates with data the insurance company currently has? Is it clean, timely, and correct?
While companies ask plenty of questions about new tools, it’s the questions they are not asking that may provide the greatest return on data investments. Here are some pragmatic questions business leaders should ask the next time they have to make a technology decision for a data monetization project.
How Can I Prove the Data is Getting There Correctly?
When moving data between systems, data won’t always fit when it arrives in the new system. This can result in data loss, unintended data transformations and costly extended project timelines. Specifically, we’re talking about datatypes and character sets. While seasoned Data Lake implementors know about this, senior management often does not because the subject can be both technical and subtle. Here is an example:
Developers regularly use extreme dates and timestamps as defaults because database designers force the app to have a value for important fields such as ‘date processed.’ They pick some year like 1 (one) or -2525 so that they don’t conflict with real dates once something is actually processed. However, extreme values on one system might be out of range — might not fit — on another. When those values don’t fit one of two things happen: The system comes to a screeching halt and there’s a mad scramble to find the root cause or; worse, and more likely with tools today, the data is silently coerced by the data movement technology into some value that does fit. Year one (1) just became 1970 and the two systems are now out of sync. The implications range from nominal to catastrophic.
It’s much more convenient to know about data loss and unanticipated transformations before you become dependent upon those data. Asking whether the data arrived correctly — bit for bit — after they moved between systems is crucial to a data monetization strategy.
How Can I Test Multiple Technologies?
I like to ask people at conferences about projects they’re working on and have planned. A common response lately has been along the lines of, “We have a Data Lake/Analytics project, but I really have no idea which technologies we’re going to use yet. We’re finally down to five contenders but two dueling camps are starting to emerge.”
Will it be Apache Hadoop and Hive, Amazon S3 and Redshift, Google BigQuery, Azure Data Lake Store or Data Warehouse, Kafka or Kinesis, Snowflake, PostgreSQL, JSON, Avro, Parquet, ORC or any combination of a seemingly limitless supply of quality choices? Odds are that whatever centralized storage and processing technology you choose today will, at least in part, change in the next year or two as you start gaining practical experience and adding new sources to feed it.
Vetting these new technologies not only means learning how to organize and analyze the data but also how to get meaningful data into those technologies in the first place. The data movement tools need to be easy to set up and use, have wide platform and datatype support, and move high volumes of data accurately. Often with current open source and other free data movement offerings you can have any two of these capabilities but not all three, making apples-to-apples comparisons difficult. The handful of vendors that do bulk data loads and stream production-volume data changes easily, comprehensively, and at scale are frequently costly and tend to avoid allowing you to kick the tires on production-scale workloads.
If you have narrowed your options to one or two destination technologies and data formats and trust in the elasticity of the Cloud then you may be able to get away with some more rudimentary testing. If you’re just starting out and have a large, seasoned team of motivated individuals you can probably mesh together a workable collection of testing technologies. But if time is short, teams are small and several options are still on the table, then don’t be afraid to hand your biggest data monster to vendors who claim to have the ideal tool set and tell them they have two or three days to prove it. If they start with trying to scale back the test, move on to the next vendor. Those who don’t flinch and go on to produce are those who are most experienced. From them you will gain not only a tool that will expedite your evaluation of storage and processing technologies but also valuable, hard-learned best practices and insights on each.
Am I Locked In?
Once you’ve finished all your reading, completed your testing, tallied up the results, got everyone’s input and made the best decision with data at hand, someone will ask, “Did you look at X?” Of course not, because it didn’t exist when the project started but now it’s all everyone is talking about. If you went with Apache Hadoop, Amazon S3, or Azure Data Lake Store (ADLS) as your storage option because it was better at supporting your analytic tool set have you just limited your ability to adapt and employ parallel technologies without incurring heavy costs?
Companies must to adapt to technology and strategies, but it is imperative they understand how disruptive those changes will be when they occur.
The answer here is to plan for both known and unknown change by employing technologies that enable multiple strategies to coexist. In particular, I’m talking about feeding and moving the data between these systems. When a company evolves its Data Strategy, it will run, for example, both a Hadoop and S3 in parallel for some time before completing the transition from one to the other. It may run them together indefinitely because each serves a unique purpose. In practice, there is usually no sudden flip in Data Strategy to the latest and greatest technology; monetized data is going to have more than one final destination. Apache Hive, for example, is going to work on HDFS, S3 and ADLS; Amazon DMS (Data Migration Service) is not. If the data movement strategy can’t handle this solution evolution — sources and volumes increase, destinations morph, latency requirements decrease — then the freedom to adapt will be far from free.
As in most healthy ecosystems, there is diversity and change. The target for where you may want to be in a year may not change, but the execution path toward it may. The impetus behind finding a new data technology is likely a problem that needs solving today. But, if companies can also plan for inevitable changes in the future, they will be much better off with technology decisions they must make now. Implementing technologies that minimize the requirement for learning skills is helpful. “Learn once, use many,” is an oft-cited mantra
Ask Questions Others Aren’t to Gain Competitive Traction
Decision makers must contend with a symphony of components that work together to generate value from data in question, knowing the decisions they make today will change a year from now. If companies ask these questions — questions their competitors aren’t asking — to understand their choices, they will be much better equipped to turn data into something useful today, tomorrow, and beyond.