Supervised machine learning (ML) models such as classification models that are trained to predict the outcome of an instance (for example, approve or deny loan) are required to maintain high accuracy in production environment – making sure that there are more true positives, true negatives and fewer false positives, false negatives. In order to maintain optimal ML model performance, these models need to be re-trained on a regular basis to avoid any performance degradation issues.
One common method to retrain ML models is to have new datasets labeled by human investigators, but the size of data that can be labeled by human investigators is based on the cost and time it will take to label the dataset. The common question that arises is how much will it cost to get a new dataset labeled? Is there any other way that’s faster, cheaper, and better to label data? The answer to this is active learning, a subset of machine learning that selectively picks data points to be labeled for optimal ML model improvements.
What Is Active Learning?
Active learning is a subset of machine learning in which a learning algorithm can query a user interactively to label data with the desired outputs. In active learning, the algorithm selects a subset of examples to be labeled by human annotators instead of labeling an entire dataset. This subset could be the data points that are near the decision boundary, where the model typically struggles to decide which class a specific data point should belong to, thereby leading to loss in its performance. By selecting only a subset of data for labeling, overall cost and time to label new data points for model retraining is reduced without compromising on performance improvements.
Below are categories of active learning:
- Pool-based sampling – The algorithm is trained on a labeled dataset and then used to pick a subset of unlabeled data to be labeled by human annotators.
- Membership query synthesis – New data points are generated by the algorithm for labeling. These labels are synthetically generated from an underlying natural distribution.
- Stream-based selective sampling – Every unlabeled data point is investigated one at a time by measuring the information gain by each data point. Based on the information gain, the algorithm decides on getting a human-annotated label or not.
How Can I Evaluate Results from Active Learning?
An A/B test can be conducted where there is a control and a test arm for model retraining. The control arm is model retraining using the traditional method of labeling the entire dataset and the test arm is retraining the machine learning model using active learning with x number of data points in n number of iterations. After every iteration of gathering labels from active learning, the model is retrained with these new labels and the model performance is compared with the control arm performance. Test arm experiment is iterated n times it achieves a similar or better performance than the control arm. Based on the number of iterations of model retraining using active learning and the number of data points labeled, we can evaluate if this is more effective in terms of cost, time, and quality as compared to the traditional labeling of the dataset.
Conclusion
While there is ongoing research in this space, such as multi-armed, bandit-based active learning, it is essential that due diligence is done in experimenting with various methods of active learning for model performance improvements versus traditional ways in order to weigh the cost and benefits of such approaches.