Active Learning¶
[ ]:
from copy import deepcopy
from autotransformers import AutoTransformer, DatasetLoader
from autotransformers.utils.misc import merge_subsets, random_split
Loading the googleplay dataset:
[ ]:
# The text snippets in this dataset are from "googleplay", a public dataset of app reviews on Google's Play Store.
dataset = {
"meta": {
"name": "example_singlelabel",
"version": "1.0.0",
"created_with": "wizard"
},
"config": [
{
"domain": "text",
"type": "IText"
},
{
"task_id": "task1",
"classes": ["positive", "neutral", "negative"],
"type": "TSingleClassification"
}
],
"train": [
[
{"value": "None of the notifications work. other people in the forums teport similar problems bht no fix. the app is nice but it isnt nearly as functional without notifications"},
{"value": "negative"},
],
[
{"value": "It's great"},
{"value": "positive"},
],
[
{"value": "Not allowing me to delete my account"},
{"value": "negative"},
],
[
{"value": "So impressed that I bought premium on very first day"},
{"value": "positive"},
],
],
"test": [
[
{"value": "Can't set more than 7 tasks without paying an absurdly expensive weekly subscription"},
{"value": "negative"},
]
],
}
For Active Learning, we have to reconfigure the dataset loader first, since usually the samples would be unlabeled and thus not suitable to train a model with. We will walk through this process, even though in this example all samples are already labeled.
At the beginning, we randomly choose a small part that must be initially labeled (in this minimal example, we only choose a single sample at the beginning). The rest is kept back in the rest_ds
.
Note: It is important to choose the first batch of samples randomly - even though one could let a newly initialized model choose the samples, this usually leads to decreased model performance.
[ ]:
dl = DatasetLoader(dataset)
rest_ds = deepcopy(dl.train_ds)
train_dss = []
next_ds, rest_ds = random_split(rest_ds, lengths=[1, len(rest_ds) - 1])
Now the samples in the next_ds
would be labeled in your preferred application. You can export the IDs (=positions in the original dataset) of the samples to label in order to easily identify which samples to label.
[ ]:
to_label_ids = next_ds.ids
After labeling, we update the dataset loader’s training set with these samples.
[ ]:
train_dss.append(next_ds)
dl.train_ds = merge_subsets(*train_dss)
Then, we train an AutoTransformer as usual, using only the small amount of initially labeled data.
[ ]:
at = AutoTransformer()
at.init(dataset_loader=dl, path=".models/example09")
at.train(dl)
We can now use at.sample()
to let the model decide on which data from the rest_ds
to label next. This will choose new samples from the so far unlabeled set, yielding a new subset to label (of size acquisition_size), and another remainder. Again, we update the dataset loader with these new samples.
[ ]:
next_ds, rest_ds = at.sample(dataset=rest_ds, acquisition_size=2)
# Now we would label the new samples in next_ds
train_dss.append(next_ds)
dl.train_ds = merge_subsets(*train_dss)
After labeling the identified samples, we can train a new AutoTransformer.
[ ]:
at.finish() # Finish the old AutoTransformer to free up memory
at = AutoTransformer()
at.init(dataset_loader=dl, path=".models/example09")
at.train(dl)
We can now repeat this process (the last 2 cells) to obtain more data to label, and train progressively better models. For this example though, we will stop here.