Walk around any large organization and hear people groan about finding the right data to do their work. In the typical organization, data sits in multiple places, lost behind technical and functional boundaries. These isolated systems, referred to as “data silos,” have often existed for good purposes and reasons such as helping each business function do their job well and meet legal requirements.
However, without a cohesive and unified data view, informed decision-making across an organization becomes difficult, and inefficiencies arise. Increases in data volume and velocity intensify this headache.
Companies tend to solve this in three ways: Maintain a distributed network of specialized databases; shift to a centralized database; or transition gradually to a federated system. Many managers then go directly to a technical solution or hire a data scientist to deal with the messy data.
Upgrading to a centralized database seems tempting. Eric Little, the CEO of LeapAnalysis and the Chief Data Officer of the OSTHUS Group, summed up this traditional mindset in a recent DATAVERSITY® interview:
“I need to take all my data laying all around my company and somehow put it in one big master system that I will build. That means getting my data across the entire enterprise, even those from 30-year-old systems which are not in use, and somehow connect with hundreds and thousands of employees across the world, who may have data scattered in a collection of text files or Excel sheets. On top of which, the person who knows what the columns mean in that 30-year-old system may be dead or retired, cruising in a catamaran along Costa Rica.”
For companies with tons of files stores, maybe even some in a data warehouse or a variety of relational systems, reorganizing data looks daunting, especially as it often involves a heavy dose of extract, transform and load (ETL). Nowadays, organizations wish to store raw data into a centralized data lake. However, extensive costs and a year-plus long project of merging data into the latest new shiny technology can be problematic.
Torsten Osthus, CEO for OSTHUS Group and a co-founder of LeapAnalysis, reflected in the same interview, “in the mid-2000s, the software industry focused on system integration and capabilities instead of data integration and managing data as a corporate asset.” But this approach is running into a brick wall with AI and machine learning. Furthermore, as Osthus said, organizations miss bringing contextual knowledge from people’s heads into the systems.
Machine learning is data hungry and voracious, on the order of petabytes of data, in order for it to be successful and to “learn.” For example, Little said, Life Science workers & researchers see “massive image files from high throughput screening, or have to search for data on proteomics and genomics” e.g. to better understand biomarkers for diseases, or they must sift through the variety of “MRI’s and scans” from doctors’ offices. Machine learning may be used to do some augmented analytics, but, as Little said, “you are not going to be able to database all that in a central location where everyone has access.”
Even if all the information was stored, there are legal ramifications. Little remarked, “certain data at one of our customer sites can’t leave Germany for legal reasons. How do you port it over to the U.S.? It can’t leave.” Furthermore, employees, like the IT guru, (e.g., master of the machines), can be quite protective of the data sources they use and control. The idea that everyone is going to form a circle in an Enterprise Information Management system, “hold hands and sing Kumbaya is a fallacy,” explained Little.
Data silos are reality, designed for a business purpose and need to stay; so, how can organizations deal with them? It’s a central piece of the LeapAnalysis puzzle to help organizations figure this out.
How to Make Data Silos Work
Achieving success with data silos requires a different approach “than thinking about what we can do with code now or even solely computer science,” said Little. “It is about making computers better with searching and working in a new way.” Little’s background in philosophy and cognitive neuroscience provides this new context. He stressed the importance of the “semantic component, the controlled vocabularies and taxonomies. All the logical stuff that organizes information” so that computations (e.g., machine learning techniques) actually work.
Torsten Osthus added to Little’s ideas:
“Let us do machine learning. But, we need to leverage the data, information and knowledge as contextual assets of digitization. Especially, we need to bring people’s knowledge, data, and business process know-how together. Brains are silos in organizations as well, those with data assets to tap. Disrupted data comes from a bottom-up approach. Create a knowledge graph, a semantic engine under the hood based on a top-down approach and bring all the data and knowledge together. It is a true federated approach where data can stay in its original source.”
Our brains thrive as pattern and association machines. So, can a computer, with a knowledge graph behind a search & analytics engine. Connect metadata to the knowledge graph and each silo, and make data FAIR: Findable, Accessible, Interoperable, and Reusable, said Osthus. The user sees the schema in the relevant data sources to explore further.
How does one get from knowledge graph to results? Little commented, “we find a very clever way to do machine learning on the data source. Pull the schema, read, and align it. If we get weird columns, go to the subject matter experts to extract meaning.” Everything stays where it is in the silo, including the Data Governance, Data Stewardship, and security. Little described how the different search engine components work:
“Put a virtual layer between the silos and the user interface. A knowledge graph lies within this middleware with semantic models, connected to a data connector & translator using API’s, REST connectors, or whatever. We make the data sources locally intelligent to self-report what they are, where they are and how to get to them. User queries from the top interface pass through the middleware via SPARQL, a language that talks with this knowledge graph. A mechanism in the knowledge graph talks directly to the data sources, filters data elements and brings the best matches as search results. Those results can then have deeper analytics run against them, be visualized, etc.”
In a matter of one click, the search engine returns high-level data from multiple sources across the data ecosystem. From the results, a person can identify data resource pockets – sets of patterns that answer their query more quickly (and learn/improve performance over time). They can further narrow down the query or explore in detail, as permissions allow.
This tool can either expunge results, cache them, or export them in a different format, e.g., CSV. The user interrogates the knowledge engine via a query or analytic, forming “a semantic to everything translator, through SPARQL, while leaving the data in place and making it easier to fetch the detailed information.” This model depicts a true data federation where data stays in place with no intensive ETL – search and analysis can happen on-the-fly.
Speed and Knowledge
LeapAnalysis puts Little’s ideas into practice with the philosophy, “Fast as hell, no ETL.” Now customers can integrate data in minutes to hours rather than months to years, bringing the right data together. As Little explained:
“We solve the problem of speed to knowledge to solve actual business problems. Can a person get a quick way to go to that knowledge? Not just building technology for the sake of building technology. Pull concepts in queries through semantics and do it in an intelligent way, through a knowledge graph. Attributes of the items inside of the algorithms, the classifiers, become clearer because the algorithms are now connected to the concepts in the knowledge graph.”
Little and Osthus highlighted four other features:
- A core engine that builds out a customer’s knowledge model with a nice usable panel consisting of a query and results pane, side by side. You can see instantly what is coming back from which data sources and judge the value and quality of the data.
- A toggle that sets a user’s favorite data schema as a reference model to which everything maps. You can use a semantic model or your favorite relational schema.
- A “sophisticated set of connectors that directly talk to data resources,” as Eric mentioned. Consumers can add purchase a variety of different ones for different data sources.
- Data virtualization that allows the knowledge engine to query against formats, such as RDF and non-RDF graphs (e.g, Neo4j or Titan), any form of relational (Oracle, SQL, etc.), NoSQL databases (MongoDB, Cassandra, etc.), and a variety of media extensions, including video and image files.
“Using a search engine to map a question semantically has been horrible for years,” Little said. Partially due to this negative experience, businesses have solved information disorganization by either combining everything from disparate sources in one place or by hiring a data scientist or similar expert with domain knowledge, to squeeze information from all the data located all over the place – a very manual effort. Such a person needs to know the ins and outs of searching, like an auto mechanic tuning up an engine.
Little and Osthus are “making the alignment between different meanings simpler through a truly federated system.” A chemist, biologist or bioinformatician can jump into their research without needing to learn a new centralized data system or sending it to someone in IT.
Osthus provided a parting thought:
“In the past, data integration was driven by costly programming and writing complex SQL Statements. Now it’s a business perspective, that can be done by the users. Embrace your Data Silos.”
Image used under license from Shutterstock.com