Bringing Data Preparation to Everyone in the Enterprise

“Whether you’re working with Big Data, or whether you’re working with very small data sets, you always need to prepare data for anything that you’re doing with it,” said Paige Schaefer, Product Marketing Manager

for Trifacta. Schaefer and colleague Matt Derda, Senior Customer Success Manager, recently talked with DATAVERSITY® about Trifacta’s Data Wrangling technologies. “Data Preparation is the process of cleaning, structuring, and enriching raw data into a desired format for better decision making in less time.”

Data Preparation takes up to 80 percent of the time and resources of any data project, according to Trifacta’s founders, who literally wrote the book on such technologies. While the demand for near-instantaneous answers is increasing, this necessary process has become more cumbersome and time-consuming as data sources have become more diverse and unstructured. Wrangler was designed to solve this problem.

About Trifacta

Trifacta was founded in 2012 by Joe Hellerstein, Jeffrey Heer, and Sean Kandel, who had careers in academia. Heer brought Data Visualization experience to the table, creating a visual interface for the product, and Hellerstein and Kandel brought expertise in Machine Learning, said Schaefer. Based on leading research, Trifacta’s Wrangler “accelerates the process from raw data to refined,” she said.

“Think about all the work you need to do to get your data in a state so that you can actually work with it.” Before it can be used, data needs to be standardized, cleaned, and organized. “Trifacta allows you to accelerate that process.” Along with speeding up the preparation process, Wrangler provides data access to people outside of the IT department. “They’re now given this tool that allows them to prepare data in a way that’s much more powerful, to give them a much better sense of their data before they take any next steps,” she said.

Customer Experience

Derda was a customer before joining the team at Trifacta and shared what he considered a typical experience. “I was at Pepsi, and Data Wrangling was exactly my pain point.” The company was paying for customer data that was never used, he said. Although they had access to portals from major retailers and could download reports, “Everything was reactive, so we didn’t look at that data until there was a problem.”

The company’s data was located in silos in multiple places, accessible only to the IT department, which exacerbated the problem. “When I first started at Pepsi, I would get these phone calls and emails about customers being over forecast, so I was asking, ‘Where is this data?’” Derda was charged with letting the customer know they were over forecast, but had no way to see why. “We had all this data that we didn’t have access to, and so there was a lot of IT interaction with that.” While IT was talking about models and Data Governance, Derda was looking at reports IT was building that simply didn’t have the information they needed. He knew the data was there, “Yet we couldn’t really do anything with it. And that’s a pretty common problem in that industry.”

Solving the Problem

Determined to get a handle on their data, Derda said they started with the basics. Pulling in small samples of data and organizing them on Excel spreadsheets, they explored “how the data all connected together from these different points, either internally or from the customer.” Finding the spreadsheet inadequate, they tried using Access, which as it turned out, was also not up to the task.

Then they tried Trifacta’s Wrangler “and it just made total sense.” No longer needing to rely on IT, they finally had access to all their data, he said. Because it was presented visually, “We could explore it, figure out how it needed to be put together, analyze it, and structure it.” Instead of wondering what it meant when a customer was over forecast, they were able to be proactive. “That was a total shift.” He shared an example of a customer who unwittingly ordered significantly more product than they could sell. “You see this spike in the data pop up and you can call the customer and ask, ‘Are you sure about this order?’” As a result, Derda was able to work with the customer and adjust the order before it was filled. “That was $7 million we saved from just that one interaction.”

Another Wrangler user, Charlotte Yarkoni, Vice President of C+E Growth and Ecosystems said that information-based organizations rely on clean data:

“But the process of cleaning and preparing data for use is time consuming and challenging. Trifacta, by leveraging Microsoft Azure Big Data and Advanced Analytics services, arms our shared customers with the ability to simplify those processes in order to more efficiently analyze the data and seek out meaningful insights.”

GlaxoSmithKline (GSK), one of the world’s largest pharmaceutical companies, has conducted thousands of clinical trials worldwide in a variety of different formats. The process of attempting to consolidate, re-use, and share that siloed clinical trial data with non-technical users had become inefficient and time-consuming, delaying drug production and wasting valuable research and development funds. GSK scientists had limited access to raw data and had to wait weeks or months to receive results, which led to missed opportunities for future clinical trials.

Since bringing Trifacta onboard, clinical researchers have gained access to needed data, which is presented in a format that accelerates the team’s understanding of the data and how it can be used. GSK now uses Wrangler to better predict how to run future trials. “With Trifacta, we’re granting broader data access to our team of clinical researchers and analysts for increased innovation in drug development, which is at the very core of GSK’s mission,” said Chuck Smith, Vice President of Data Strategy.

Donny Momchilov, IT director at Donnelley Financial Solutions, runs Wrangler on Microsoft Azure. Donnelley provides data solutions that help clients meet complex regulatory requirements, and formerly spent months onboarding new customers due to the varied nature of client data in a rigid ETL environment. After deployment of Trifacta he said, “We’re seeing dramatically shorter development times and have been able to grant our business users more ownership over the data they need to wrangle.”

Unique Selling Proposition

Schaefer said Wrangler’s user-friendly interface is a unique selling point.

“You have a histogram at the top of the product to see the relevant distribution of your data. You have predictive transformation that suggests how you might want to transform your data. You don’t really have to think up the next move — the machine does it for you”

Because Machine Learning and AI are integrated into the data transformation process, “It’s constantly learning, not just from your moves, but from the broader collective as well,” and that learning gets included in each suggested transformation, she said. Users can also easily edit or change any process they’ve initiated, and the interface uses accessible human language, “So it’s really easy for someone who doesn’t have technical expertise. You get up and running right away.”

Derda agrees: “Everybody always comments on is how easy it is to use, that it’s ‘point-and-click’ and uses natural language.” Customers also like the architecture for its scalability, being able to perform well across small and large stores of data. Another useful feature is Wrangler’s ability to let nontechnical users generate samples across data, connecting without moving or copying data, he said. Users are free to use data as needed and IT is not tied up dealing with desktops and user requests. “It’s the best of both worlds,” he said.

Recent Developments

This year the company started a user community, and held their first user community event in New York City.

“Data Wrangling has become so popular that we now have user groups around the world.” These groups have become a way for users to meet in person and share ideas and best practices. At the New York event, Derda said, “We were all expecting to have to answer questions and help people out,” yet the users had a surprising amount of knowledge about how the product works. “There would be a question from a user and other users would answer it. That was really cool to see.” They also didn’t expect such a large crowd to show up to a wrangling event. “Spending the night talking about data — that doesn’t sound like the most fun night ever, yet people were willing to do it,” he said.

Trifacta has gained Microsoft Co-Sell Partner status and also recently announced the availability of Wrangler Enterprise in the Azure Marketplace, allowing organizations to deploy Trifacta in less than 30 minutes.

Schaefer said that when Data Wrangling was first emerging as a term, it was seen as something confined to Data Science or IT. Then the term started being used in a Big Data context, but now its use is becoming more common. “I think it will trend in the future because Data Wrangling is everyone’s problem.”

Photo Credit: sdecoret/Shutterstock.com

TAKE OUR DATA MANAGEMENT CERTIFICATION PREP COURSES

Data Topics

Bringing Data Preparation to Everyone in the Enterprise

Leave a Reply Cancel reply