In my previous blog post, I described some concrete techniques and surveyed some early approaches to artificial intelligence (AI) and found that they still offer attractive opportunities for improving the user experience. In this post, we’ll look at some more mathematical and algorithmic approaches to creating usable business intelligence from big piles of data.
Regression Analysis
Regression analysis is a technique that predates machine learning but can often be used to perform many of the same kinds of tasks and answer many of the same kinds of questions. It can be viewed as an early approach to machine learning, in that it provides a tool with which to reduce to mechanical calculation the process of determining whether there exist meaningful relationships in data.
The basic idea of regression analysis is that you start with a bunch of data points and want to predict one attribute of those data points based on the other attributes. For instance, we might want to predict for a given customer the amount of a loan they might like to request at a particular time, or whether some marketing strategy may or may not be effective, or other quantifiable aspects of the customer’s potential future behavior.
Next, you choose a parameterized class of functions that relate the dependent variable to the independent variables. A common and useful class of functions, and one that can be used in the absence of more specific knowledge about underlying relationships in the data, are linear functions of the form f(x) = a + bx. Here, f is a function with parameters a and b, which takes the vector x representing the independent variables belonging to a data point and maps that vector to the corresponding predicted value of the dependent variable.
Once a parameterized class of functions has been chosen, the last step before performing the regression is to identify an appropriate distance metric to measure the error between values predicted by the curve of best fit and the data on which that curve is trained. If we choose linear functions and squared vertical difference between the line and the sample points, we get the ubiquitous least-squares linear regression technique. Other classes of functions – polynomial, logistic, sinusoidal, exponential – may be appropriate in some contexts, just as other distance metrics – such as absolute value rather than squared value – may give results that represent a better fit in some applications.
Once the hyperparameters (selection of dependent variable, class of functions, and distance metric) for the regression problem have been chosen, optimal parameter values can be solved by using a combination of manual analysis and computer calculation. These optimal parameters identify a particular function belonging to the parameterized class that fits the available data points more closely than any other function in the class, according to the chosen distance metric. Measures of goodness of fit – such as the correlation coefficient and chi-squared coefficient – can help us answer not only how closely our curve matches the training data, but also whether we have “overfit” that data – that is, whether we should expect there are simpler curves that provide nearly as good a fit as the one under consideration.
Often, the dependent variables we care about do not vary over a continuous range of values. For instance, we might be interested only in whether we should expect some new data point will or won’t have some characteristic. In other cases, we might want to label new data points with what we expect to be accurate labels from some relatively small, fixed set of labels. For example, we might want to assign a customer to one of several processing queues depending on what we expect those customers’ needs to be.
While regression analysis can still be used in these scenarios – by fitting some curves and assigning ranges of values of the dependent variable to fixed labels – so-called classification techniques can also be used. One benefit of using classification approaches, where possible, is that these techniques can find relationships that may not be analytically tractable – that is, relationships that could be hard to describe using parameterized classes of analytic functions.
One popular approach to classification involves constructing decision trees based on the training data that, at each stage of branching, seek to maximize the achieved information gain, in the information-theoretic sense.
As a very simple example, suppose the training data set consists of data points that give a person’s name, whether they graduated from high school, and whether they are currently employed. Our training data set might look like (John, yes, yes), (Jane, yes, yes), (John, no, no). If we want to construct a decision tree to aid in determining whether new individuals are likely to be employed based on their name and high-school graduation status, we should choose to split first on the graduation status, because doing so splits the sample space into two groups that are most distinct concerning the dependent variable: one group has 100% yes and the other has 100% no. Had we branched on names first, we would have had one group with 50% yes and 50% no, and another with 100% yes – these groups are less distinct.
In more complicated scenarios, branching would continue at each level, as long as groups could still meaningfully be split into increasingly distinct subgroups and then end. The resulting decision tree would give a method according to which new samples could be classified: simply find where they fit in the tree according to their characteristics.
Another approach to classification involves attempting to split the training dataset in two by finding a hyperplane, which best separates samples with different labels. When there are only two independent variables, the hyperplane is a normal two-dimensional line.
For instance, suppose our training dataset consists of types of trees and coordinates in a large field where those trees grow. The data points might be (1, 1, apple), (2, 1, apple), (1, 2, apple), (4, 1, pear), (1, 4, pear) and (4, 4, pear). A line with equation y = 3 – x separates all the apple trees from all the pear trees, and we could use that line to predict whether trees will be more likely to be apple or pear trees by checking which side of the line the tree is on. Finding the best hyperplane can be reduced to a quadratic programming problem and solved numerically.
Clustering
The approaches to data analysis and data mining we’ve looked at so far can be considered examples of supervised machine learning: they are supervised in the sense that we (humans) label the training data set for the computer, and the computer can learn the relationships by trusting our labels. You may be wondering what kinds of problems and approaches can be used for unsupervised machine learning, in case we don’t know how to meaningfully label the data ourselves. Clustering is a useful way to uncover potentially useful relationships in data that we might not even know to look for.
Given a bunch of data points, clustering seeks to divide the sample space into groups – or clusters – where members of each cluster are more similar to each other than they are to members of other clusters, based on their characteristics. A bottom-up approach to clustering is to make every data element a cluster initially, and then iteratively combine the two closest clusters into a single cluster, until you end up with just one cluster. This creates a tree that defines sets of increasingly fine-grained clusters at lower levels of the hierarchy. A top-down approach might start with a single cluster and iteratively split the cluster by separating the data element that is most different from the average element in the cluster and moving the data points close to that point into the new cluster. Other approaches, k-nearest-neighbors and k-means, work similarly and employ heuristics to improve the performance of the clustering process.
We’ve seen how traditional mathematical, statistical, and algorithmic techniques can be used to analyze data and derive useful information about the relationships in that data. All of these techniques, and many like them, are easily automated and take the human more or less out of the loop of figuring out the relationships of interest.
These techniques, however, are still inherently constrained by the imagination and intelligence of the humans employing them: Performing a linear regression will always give you the equation of a line, even if the relationships are non-linear; clustering will only cluster by the chosen distance metric, not by one that may be more natural for the given dataset; and so on. However, the advances being made in machine learning and artificial intelligence are incredibly exciting and I look forward to the next developments our industry will make.