Click to learn more about author Paolo Tamagnini.
Welcome to the seventh episode of our Guided Labeling Blog Series. In the last six episodes, we have covered active learning and weak supervision theory. Today, we would like to present a practical example of implementing weak supervision via guided analytics based on a Workflow.
The other episodes are here:
- Guided Labeling Episode 1: An Introduction to Active Learning
- Guided Labeling Episode 2: Label Density
- Guided Labeling Episode 3: Model Uncertainty
- Guided Labeling Episode 4: From Exploration to Exploitation
- Guided Labeling Episode 5: Blending Knowledge with Weak Supervision
- Guided Labeling Episode 6: Comparing Active Learning with Weak Supervision
A Document Classification Problem
Let’s assume you want to train a document classifier, a supervised machine learning model that will predict precise categories for each of your unlabeled documents. This model is required, for example, when dealing with large collections of unlabeled medical records, legal documents, or spam emails, defining a recurrent problem across several industries.
In our example, we will:
- Build an application able to digest any kind of documents
- Transform the documents into bags of words
- Train a weak supervision model using a labeling function provided by the user
We would not need weak supervision if we had labels for each document in our training set, but as our document corpus is unlabeled, we will use weak supervision and create a web-based application to ask the document expert to provide heuristics (labeling functions).
Labeling Function in Document Classification
What kind of labeling function should we use for this weak supervision problem?
Well, we need a heuristic, a rule, which looks for something in the text of a document and, based on that, applies the label to the document. If the rule does not find any matching text, it can leave the label missing.
As a quick example, let’s imagine we want to perform sentiment analysis on movie reviews and label each review as either “positive (P)” or “negative (N).” Each movie review is subsequently a document, and we need to build a somewhat accurate labeling function to label certain documents as “positive (P).” A practical example is pictured in Figure 2 below.
By providing many labeling functions like the one in Figure 2, it is possible to train a weak supervision model that is able to detect sentiment in movie reviews. The input of the label model (Figure 1) would be similar to the table shown in Figure 3 (below). As you can see, no feature data is attached to such a table, only the output of several labeling functions on all available training data.
Once the labeling functions are provided, it only takes a few moments to apply them to thousands of documents and feed them to the label model (Figure 4 below).
Guided Analytics with Weak Supervision on the WebPortal
In order to enable the document expert to create a weak supervision model, we can use Guided Analytics. Using a web-based application that offers a sequence of interactive views, the user can:
- Upload the documents
- Define the possible labels the final document classifier needs to make a prediction
- Input the labeling functions
- Train the label model
- Train the discriminative model
- Assess the performance
We created a blueprint for this kind of application in a sequence of three interactive views, as shown in Figure 5 (below). The generated web-based application can be accessed via any web browser in the WebPortal.
The implementation of this application was possible in the form of the workflow (Figure 6 below) currently available on the Hub. The workflow is using the Weak Supervision extension to train the label model with a Weak Label Model Learner node and Gradient Boosted Trees Learner node to train the Discriminative Model. Besides the Gradient Boosted Tree algorithm, others are also available, which can be used in conjunction with the Weak Label Model nodes (Figure 6).
When Does Weak Supervision Work?
In this episode of our Guided Labeling Blog Series, we have shown how to use weak supervision for document classification. We have described a single use case here, but the same approach can be applied to images, tabular data, multiclass classification, and many other scenarios. As long as your domain expert can provide the labeling functions, the open source Analytics Platform can provide a workflow to be deployed on the Server and make it accessible via the WebPortal.
What are the requirements for the labeling functions/sources in order to train a good weak supervision model?
- Moderate Number of Label Sources: The label sources need to be sufficient in number — in certain use cases, up to 100.
- Label Sources Are Uncorrelated: Currently, the implementation of the label model does not take into account strong correlations. So it is best if your domain expert does not provide labeling functions that depend on one another.
- Sources Overlap:The labeling functions/sources need to overlap in order for the algorithm to detect patterns of agreement and conflicts. If the labeling sources provide labels for a set of samples that do not intersect, the weak supervision approach is not going to be able to estimate which source should be trusted.
- Sources Are Not Too Sparse: If all labeling functions label only a small percentage of the total number of samples, this will affect the model performance.
- Sources Are Better Than Random Guessing: This is an easy requirement to satisfy. It should be possible to create labeling functions simply by laying down the logic used by manual labeling work as rules.
- No Adversarial Sources Allowed: Weak supervision is considerably more flexible than other machine learning strategies when dealing with noisy labels, i.e., weak label sources are simply better than random guessing. Despite this, weak supervision is not flexible enough to deal with weak sources that are always wrong. This might happen when one of the labeling functions is faulty and, subsequently, worse than simply random guessing. When collecting weak label sources, it is more important to focus on spotting those “bad apples” rather than spending time decreasing the overall noise in the Weak Label Sources Matrix.
Looking Ahead
In the upcoming final episode of the Guided Labeling Blog Series, we will look at how to combine active learning and weak supervision in a single, interactive Guided Analytics application.
This is an on-going series on guided labeling; see each episode at:
- Guided Labeling Episode 1: An Introduction to Active Learning
- Guided Labeling Episode 2: Label Density
- Guided Labeling Episode 3: Model Uncertainty
- Guided Labeling Episode 4: From Exploration to Exploitation
- Guided Labeling Episode 5: Blending Knowledge with Weak Supervision
- Guided Labeling Episode 6: Comparing Active Learning with Weak Supervision