Having herself held senior roles in IT at Wall Street companies including Deutsche Bank and Morgan Stanley Smith Barney, Oksana Sokolovsky is quite familiar with the challenge of Data Management and data discovery. As co-founder and CEO of ROKITT, her goal was “to build a product that solves that challenge,” she says.
The challenge exists across large enterprises in multiple industries, but is often especially acute in those dealing with regulatory pressures and compliance requirements – healthcare, for instance, and of course, the financial sector. Basel Committee on Banking Supervision (BCBS) 239 compliance for effective risk data aggregation and reporting, for example, is a big driver of improved Data Management for global systemically important banks.
In fact, a McKinsey & Company and Institute of International Finance survey showed that more than half of the world’s biggest banks faced significant challenges meeting the January 1, 2016 deadline for compliance, with the Global Association of Risk Professionals commenting that “many institutions continue to struggle to fully implement the requirements across the business under the most demanding interpretation of those requirements.”
ROKITT’s Astra solution, Sokolovsky believes, can help banks support adherence to both internal and external regulations and policies, like BCBS 239, across complex data landscapes, as well as support other use cases for a variety of enterprises: Data asset governance to better utilize data to enhance business value, for example.
Machine Learning Drives Data and Relationship Discovery
Astra debuted in March after a year of development. What sets the technology apart, according to Sokolovsky, is its ability “to let customers discover data and information about data and its relationships, in order to manage data better or meet regulations more efficiently, using our custom-built machine learning concepts and other advanced algorithms.” Astra’s algorithms automatically discover and self-learn data relationships with up to 90% accuracy, the company says.
It can recognize, for example, connections or dependencies between values within a database that may not be obvious – perhaps that a column in one table contains data that refers to a column in another table, and why that relationship exists. It will learn that two columns in different tables that both carry customer information, but are called “Customer” in one instance and “Company” in another, for example are the same. It will then establish the relationship that ‘customer’ is the primary key and ‘company’ is the secondary key, she says, and apply that knowledge from that point on across databases, XML documents, and flat files. Columns don’t even need to have reasonable names like this for the solution to figure out the relationship – if one table was called ‘xyz’ and another ‘qwerty17,’ it can still figure out that both hold customer information.
The system reads data in repositories and learns its true Metadata. In fact, applying its Machine Learning algorithms to the data itself, rather than just to Metadata, is critical: As companies and systems grow, Metadata only holds so much information about data, Sokolovsky says, limiting possibilities. Applying Machine Learning algorithms to the data itself expands the ability to discover data relationships beyond the 10 to 20 percent possible when such algorithms are applied to Metadata only. The data discovery process, she says, is fast, too, using its next-generation async processing architecture. “It’s measured in hours and minutes,” says Sokolovsky.
Some of the company’s customers have discovered database schemas they didn’t know existed using Astra, she says. It’s not surprising they do exist, though, when you consider that some databases may be over a decade old, and documentation about how the data is organized and how the relations among them are associated may no longer exist, if they ever did. Machine Learning helps overcome that lack of documentation and hastens the ability to understand these databases’ complexity, which is helpful not only for compliance and governance, but to enable meaningful analysis for business growth or revenue purposes, as well.
Another customer of the company sees the opportunity to use the tool’s data discovery capabilities as a way to smooth its path to the data lake, Sokolovsky says. “They want to understand their legacy data before they migrate into the lake,” she explains. “If they don’t understand that data, they might experience hiccups that could delay the project cycle, and that could cause a lot more dollars to be spent.”
Currently, Astra’s Machine Learning and other advanced algorithms can be applied only to structured data in relational databases, but Sokolovsky says there are plans to leverage Astra’s capabilities with unstructured data as well within the next year or so. ROKITT has focused the tool on structured data to date, because “most large enterprises that struggle with Data Management are sitting on a lot of relational databases and haven’t yet made a big transition to the unstructured world,” she explains.
What’s the Data Flow?
As data and its relationships are discovered, even complex environments can become more understandable, and how information flows between and within relational databases from origin until end – aka, data lineage – can become more apparent. That permits users to understand where information originates – one of the requirements for fulfilling BCBS 239 – and helps in rationalizing current data, as well as in realizing what data is exposed to whom so that compliance violations and risks can be minimized. Among other things, data lineage also supports understanding changes and their impacts and data consumption, and enables organizations to simplify data flows.
Basically, “the enterprise can manage the data because it has knowledge about it,” she says. For many customers, it’s eye opening to see how data moves through the system and how they can leverage the knowledge to support Data Governance. “For regulatory purposes, for instance, they can see where sensitive customer information is maintained, where the personally identifiable information (PPI) is, and who has access to that data.”
Data discovery flow and understanding how data cycles across environments is also helpful when it comes to test Data Management for software development and enabling DevOps. DevOps pros, for example, can leverage this to get data that covers all test conditions, whereas testers can enable test data availability by creating synthetic data to avoid exposing real data. Synthetic data, the company explains, leverages ROKITT Astra’s discovery-beyond-metadata capabilities. Through data discovery the technology understands the structure of production data including relationships and flows. “This helps us to generate synthetic data that retains referential integrity, which has been a common problem in the industry,” she says
ROKITT is currently working with three financial institutions in New York as well as several health-related and retail companies on compliance and governance scenarios, she says.