The term, machine learning dates back to a 1959 article by Arthur Samuel, in which he posited: “Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.” The hope was that data alone could be used to develop models, rather than relying on fixed rules or theory.
Aurélien Géron highlighted this in 2017, stating “Machine learning is the science (and art) of programming computers so they can learn from data.” Both definitions frame machine learning as a data-driven discipline, training computers to learn through observation and experience, instead of rigid pre-programming. Machine learning’s reliance on data reveals one of its most pressing challenges: sensitive data, inaccurate data, biased data, noisy data, and irrelevant data lead to poor and risky outcomes.
Dr. Joseph Regensburger, the VP of Research at Immuta, the Automated Data Governance company, recently spoke DATAVERSITY® about the difficulties and challenges faced by organizations trying to come to terms with these technologies. He originally came to Data Science by way of his work in experimental particle physics. He joined Immuta because of the solutions they offered. During the interview he remarked that:
“The industry is moving in the direction of productionizing machine learning organizations are going to have to address the areas that have been short-changed somewhat: around privacy, around whether or not correlations can be generalized, and around fairness. Those are the sorts of challenges I see people are struggling with.”
Basic Types of Machine Learning
There are four basic kinds of ML algorithms: reinforcement, unsupervised, semi-supervised, and supervised.
- Reinforcement learning is focused on tightly controlled learning parameters, with a machine learning algorithm that receives descriptions of actions, constraints, and end values. With the rules clearly defined, ML algorithms will then explore different options using a trial and error philosophy. The algorithm learns from previous experiences and adjusts its approach to achieve the best results.
- In unsupervised learning, the algorithm searches through data to find and identify patterns. There is no human operator to provide instructions. The algorithm establishes the correlations and relationships as it analyzes the available data. The algorithm then organizes the data into a structure. As more data is assessed, the algorithm’s ability to make decisions, based on the data, gradually improves. Unsupervised learning allows for dealing with problems that have no clear answers about what the results look like.
- Semi-supervised learning uses both labeled and unlabeled data. Labeled data means information that uses meaningful tags, so the algorithm can understand the data. Unlabeled data doesn’t have those meaningful tags. With this combination, ML algorithms learn how to label, or identify, unlabeled data.
- With supervised learning, the algorithm is taught by way of example. Supervised learning uses standardized goals of how the correct output should look. The algorithm receives a known dataset that includes the desired inputs and outputs. The algorithm is then commanded to find a path leading to those inputs and outputs. Though the operator has the correct answer, the algorithm learns to identify patterns in data. The algorithm makes predictions and is corrected by the operator, and this process continues until the algorithm achieves a high level of accuracy/performance.
Optimizing algorithms requires considering a variety of factors, including: data size, goals, and quality. This optimization is challenging even for the most experienced of data scientists, said Regensburger. It is difficult to predict generally how an algorithm will perform, requiring careful experimentation and analysis. Experimenting with many different approaches, while maintaining a “test base” for comparing and evaluating performance, can be quite useful.
Regensburger stated that:
“Machine learning is advancing at full tilt, but approaches to managing data have held it back, failing the promise of the algorithm-driven enterprise. Until now. You can now make your data discoverable without physically moving or copying it. Data scientists can connect any tool directly and governance professionals can write condition-based policies that are dynamically applied to the data. The result? Less regulatory burden falls on the data scientist. Their access to data is streamlined. Meaning better, more accurate models are deployed faster, with greater durability, less business risk, and more powerful insights.”
Algorithms and Fairness
An algorithm is a series of specific steps designed to accomplish a task or goal. Food recipes are a good example of algorithms for humans. As with computer algorithms, a good recipe describes the specific steps needed to achieve the goal. A computer reads an algorithm, and then follows it exactly, providing results called outputs.
Computer algorithms are often used as functions. These functions act as smaller programs that are referenced by larger programs. An image-viewing application, for example, would have a library of functions, with each using a specific algorithm to display different image formats. Spell-checking and search engines also use algorithms. As a general rule, most tasks performed by a computer use algorithms. In discussing algorithms, Regensburger told a story of a large internet retailer:
“They had a prototype system that they were working on to identify good candidates in their HR system. And what they found was that their recommendation system was biased against women. So, women’s colleges, traditionally women’s sports, or extracurriculars actually were down-weighted in their system and it was very hard for them to remove that baked-in bias. So that’s one of the challenges.”
He then mentioned the book Weapons of Math Destruction, that hits on a lot of these issues about how it’s possible identify fairness in algorithms. It’s very hard to remove that bias from processes, “and so, I think there’s this siren song that people are getting seduced by, that algorithms are going to solve all of our bias issues, and they’re not,” he remarked. He does believe that with Data Science, as it moves forward, will help people realize this potential for removing bias though:
“There’s this whole other field of algorithmic fairness, and being able to make sure that the algorithms you’re using and developing are making decisions that are more fair, that reduce, or assess implicit biases, or a bias that comes from this very large historical record.”
Data Governance Steps In
Digital data has been going through a Wild West phase for the last couple of decades. That is changing. Russia and China, for protectionist/national security reasons, are building barriers to global communications and commerce online. Europe, on the other hand, recently enacted the General Data Protection Regulation (GDPR) to protect the privacy of its citizens. There are dozens of countries developing their own laws as well.
Data Governance is becoming quite important in terms of such regulations and better control over all an enterprise’s data assets. In the United States there are laws being enacted by numerous states, especially in California with CCPA in January of 2020, and more are coming. On the issue of Data Governance, Regensburger remarked:
“You have data governors, whose responsibility is making sure that data is being used properly, that policies are coherently enforced across an organization, that people are following all pertinent regulations, and best practices. In an old system, what those data governors would have to do is essentially make sure policies were being enforced on every potential data silo within an organization.”
Such practices are now this has become a huge problem, he said. Someone now says, “I have a hundred different data silos, and how do I make sure things are being enforced consistently across all those?” An organization has to have centralized access and control over all those disparate systems; machine learning, algorithms, and integrated and automated data governance platform allow that to happen.
What Immuta Does
The Immuta Automated Data Governance Platform creates trust across security, legal, compliance, and business teams so they can work together to ensure timely access to critical business data with minimal risks. Its automated, scalable, no code approach makes it easy for users across an organization to access the data they need on demand, while protecting privacy and enforcing regulatory policies on all data. Describing Immuta, Regensburger said:
“One of the big problems within any analytics-driven enterprise is getting to the data. It becomes a real nightmare. You have to go through a stratified level of authorizations to get to it. But Immuta has streamlined all of that.”
At the same time, they have worked through some of the real challenges with how automate Data Governance, enabling Data Science to go from a proof-of-concept to production in a streamlined and responsible manner. Organizations have to deal with a lot of the ethical and regulatory challenges around privacy to be able to practice Data Science in a more robust way. Immuta is helping solve that, he noted:
“I’ve been running our research team for about a year, now. Mostly looking at developing new privacy enhancing technology for the platform. There are lot of tools that help you find patterns, a lot of tools that allow you to store data. But it doesn’t give you, necessarily, insight or aid in enhancing privacy and utilizing data.”
Immuta has the necessary components to deal with privacy and risk, along with effective governance, on a platform that helps data scientists and analysts do their work reliably in a unified and coordinated manner.
Image used under license from Shutterstock.com