Active Learning¶

Active Learning is a technique to reduce the amount of data you manually have to label. Instead of labeling all data at the beginning, we only label a small number of samples and immediately train a model. This model then selects the most relevant data to label next, and a new model is trained with the expanded labeled set.

This process is then repeated until a satisfactory result is reached.

[ ]:

from copy import deepcopy
from autotransformers import AutoTransformer, DatasetLoader
from autotransformers.utils.misc import merge_subsets, random_split

Loading the googleplay dataset:

[ ]:

# The text snippets in this dataset are from "googleplay", a public dataset of app reviews on Google's Play Store.
dataset = {
    "meta": {
        "name": "example_singlelabel",
        "version": "1.0.0",
        "created_with": "wizard"
    },
    "config": [
        {
            "domain": "text",
            "type": "IText"
        },
        {
            "task_id": "task1",
            "classes": ["positive", "neutral", "negative"],
            "type": "TSingleClassification"
        }
    ],
    "train": [
        [
            {"value": "None of the notifications work. other people in the forums teport similar problems bht no fix. the app is nice but it isnt nearly as functional without notifications"},
            {"value": "negative"},
        ],
        [
            {"value": "It's great"},
            {"value": "positive"},
        ],
        [
            {"value": "Not allowing me to delete my account"},
            {"value": "negative"},
        ],
        [
            {"value": "So impressed that I bought premium on very first day"},
            {"value": "positive"},
        ],
    ],
    "test": [
        [
            {"value": "Can't set more than 7 tasks without paying an absurdly expensive weekly subscription"},
            {"value": "negative"},
        ]
    ],
}

For Active Learning, we have to reconfigure the dataset loader first, since usually the samples would be unlabeled and thus not suitable to train a model with. We will walk through this process, even though in this example all samples are already labeled.

At the beginning, we randomly choose a small part that must be initially labeled (in this minimal example, we only choose a single sample at the beginning). The rest is kept back in the rest_ds.

Note: It is important to choose the first batch of samples randomly - even though one could let a newly initialized model choose the samples, this usually leads to decreased model performance.

[ ]:

dl = DatasetLoader(dataset)

rest_ds = deepcopy(dl.train_ds)

train_dss = []
next_ds, rest_ds = random_split(rest_ds, lengths=[1, len(rest_ds) - 1])

Now the samples in the next_ds would be labeled in your preferred application. You can export the IDs (=positions in the original dataset) of the samples to label in order to easily identify which samples to label.

[ ]:

to_label_ids = next_ds.ids

After labeling, we update the dataset loader’s training set with these samples.

[ ]:

train_dss.append(next_ds)
dl.train_ds = merge_subsets(*train_dss)

Then, we train an AutoTransformer as usual, using only the small amount of initially labeled data.

[ ]:

at = AutoTransformer()
at.init(dataset_loader=dl, path=".models/example09")
at.train(dl)

We can now use at.sample() to let the model decide on which data from the rest_ds to label next. This will choose new samples from the so far unlabeled set, yielding a new subset to label (of size acquisition_size), and another remainder. Again, we update the dataset loader with these new samples.

[ ]:

next_ds, rest_ds = at.sample(dataset=rest_ds, acquisition_size=2)

# Now we would label the new samples in next_ds

train_dss.append(next_ds)
dl.train_ds = merge_subsets(*train_dss)

After labeling the identified samples, we can train a new AutoTransformer.

[ ]:

at.finish() # Finish the old AutoTransformer to free up memory

at = AutoTransformer()
at.init(dataset_loader=dl, path=".models/example09")
at.train(dl)

We can now repeat this process (the last 2 cells) to obtain more data to label, and train progressively better models. For this example though, we will stop here.