A Complete Guide to Weak Supervision in Machine Learning
A machine learning model works accurately when the data provided covers precisely the domain for which the model is designed and is structured according to the characteristics of the model. Since most of the available data is in an unstructured or loosely structured format, annotating this type of data uses the concept of weak supervision in machine learning. Especially if the data is annotated but of poor quality, weak supervision comes into play. In this article, we will try to understand weak supervision in detail as well as the approach and strategies for performing weak supervision. The main points to cover in this article are listed below.
- What is weak supervision?
- Evolution of weak supervision
- Problems with labeled workout data
- How do I get more labeled training data?
- Weak label types
- Basic system features to support weak supervision
Let’s start by understanding weak supervision.
What is weak supervision?
Weak supervision is part of machine learning where unorganized or imprecise data is used to provide hints to label a large amount of unsupervised data so that a large amount of data can be used in machine learning or l supervised learning. More formally, we can say that the indication is a kind of supervisory signal to tag unlabeled data. As we know that obtaining hand-labeled datasets is so expensive and time consuming, this approach tries to reduce manual data labeling efforts by providing labels to certain data and using certain data to provide the data. labels to unlabeled data.
Especially in natural language processing where we have a lot of data specific models so a pre-trained model doesn’t work well with specific models. In such cases, weak supervision helps improve model performance with respect to models. making the data applicable to modeling takes a great deal of effort, time and money. In order to structure a dataset, we can divide the data annotation levels into three parts where, if the data is strongly annotated, we can directly proceed to the modeling procedure where the model can belong to supervised learning ( if the data is large), unsupervised learning and transfer learning (if the data is small), if the data is not annotated, we follow the unsupervised learning procedures like clustering, PCA , etc.
Evolution of weak supervision
At first, AI focused mainly on the expert system. In what combination of the SME knowledge base with the included inference engine. Where, in the midst of the age of artificial intelligence models, completed tasks based on data labeled in a powerful and flexible way began. Where the classic ML approaches were introduced which consisted mainly of two ways of bringing the knowledge base of experts in the field. The first is to provide a small amount of manually labeled data to domain expert models and the second is to provide hand-designed functionality so that the functionality can process the basic representation of the model data.
Where in the modern age, deep learning projects are booming due to their ability to learn representations in many fields and tasks. These models not only facilitate the engineering of the functionalities, but also many systems are generated to automatically label the data as tuba is a system that supports and explores the interaction with machine learning. The system only asks for tagging functions, black box snippets that help tag unlabeled subsets of data. So from a basic part of weak supervision to an advanced part of supervision, weak supervision has evolved, and yet people are trying to perform better in the field by finding new ways to do it. improve weak supervision.
Problems with labeled workout data
Here are the main issues with tagged training data:
Insufficient amount of labeled data
In the early stages of machine learning training, the models depend on the labeled data and the problems are that most of the data was not labeled or was not sufficient to apply to the models for better training. Obtaining training data was nearly impossible, expensive or time consuming.
Insufficient subject matter expertise to label data
When it comes to labeling unlabeled data, we need a person or team of subject matter experts. Instead of having such facilities, human intervention in data labeling takes a lot of time and the cost of SME is also included. Which makes the process impractical.
Insufficient time to label and prepare data
Before implementing a machine learning model in any data, the task of preprocessing the data is mandatory for best performance. When it comes to real-life experiences, we have a lot of data, but not all of the data is prepared to be able to be deployed on the model. It is almost impossible to make accurate data quickly depending on the model.
To overcome all of these issues, we need rigid and reliable approaches so that we can do a significant part of data preprocessing, which is data labeling.
How do I get more labeled training data?
In any situation this is the most traditional approach to getting labeled data where we hire the SME (subject matter expert) to label the data but when things happen with the large unlabeled datasets , the process becomes so expensive and difficult for one person or group of people to provide the labels. In such a scenario to reduce efforts, we basically follow three main approaches:
- Active learning – the main objective of the active learning approach is to provide label data points which are most valuable to the model or we can say that we are selecting new data points which need to be labeled. For example, we have a feeling of anger in the sentiment analysis that is close to the model’s decision limits and in this case we ask SME to label only the sentences included in the sentiment. Or we can opt for weaker supervision for these data points only so that active learning can become more complementary with weak supervision.
- Semi-supervised learning – the main objective of this approach is to use a small set of labeled data with a large set of unlabeled data at a high level assuming smoothing and low distance metrics of the unlabeled data. It helps reduce the efforts of SMEs by taking these assumptions to leverage unlabeled data to leverage unlabeled data. We work with these approaches when data is available at low cost in large quantities. Generative approaches such as generative contradictory networks, heuristic transformation models help to regularize decision boundaries.
- Transfer learning – the main objective of the approach is to create an already trained model to learn more about the data we have. A model that is already trained on the different datasets can be applied to the dataset if we have similarities between the previously trained datasets and the dataset in which we are going to apply the model. A common approach in today’s deep learning scenario is to create a model, train it in a large data set, fine tune it, and use the model for the task of interest.
The above approaches certainly help reduce data tagging efforts. In the image above, we can see how weak supervision helps cover the downsides of other approaches. Based on the tag type, we can categorize the weak tags in a way below.
Weak label types
There are three main types of weak labels:
Inaccurate or inaccurate labels: this type of label can be obtained by an active learning approach where the expertise of the subject gives less precise labels to the data to the developers. And then developers can use weak labels to create rules, define distributions, apply other constraints on training data.
Inaccurate labels: this type of label can be obtained by semi-supervised learning where the labels on the datasets can be lower quality labels by expensive means like crowdsourcing. developers can use the labels obtained by regularizing the decision limits of the model. such labels which are numerous, but not perfectly exact.
Existing labels: this type of label can be obtained from existing resources such as the knowledge base, alternative data for training, or from data used in the pre-trained model. These labels can be used by developers but they are not perfectly applicable for the task entrusted to the model. In such a scenario, the use of a pre-trained model is beneficial.
Basic system features to support weak supervision
From now on, we have seen what the specification of weak supervision can be and can easily understand the characteristics that can consist of a system, which is designed to support weak supervision. We can say that labeling data using a function can give noisy output. We need a function to label data and models to determine the extent of labeling accuracy. A system can have three characteristics:
- A labeling function to provide labels to unlabeled data.
- A model for learning the precision of labeling.
- A template that can generate the set of training labels.
In the article, we saw what the weakness of supervision is as well as its evolution in three parts. Additionally, we got to know the issues with unlabeled data in modeling and what can be the approaches to make the data labeled and if a system is generated to support weak supervision and what should be the main ones. features to help perform weak supervision.
Join our Discord server. Be part of an engaging online community. Join here.
Subscribe to our newsletter
Receive the latest updates and relevant offers by sharing your email.