{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Active Learning\n", "\n", "Active Learning is a technique to reduce the amount of data you manually have to label. Instead of labeling all data at the beginning, we only label a small number of samples and immediately train a model. This model then selects the most relevant data to label next, and a new model is trained with the expanded labeled set. \n", "This process is then repeated until a satisfactory result is reached." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from copy import deepcopy\n", "from autotransformers import AutoTransformer, DatasetLoader\n", "from autotransformers.utils.misc import merge_subsets, random_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading the googleplay dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The text snippets in this dataset are from \"googleplay\", a public dataset of app reviews on Google's Play Store.\n", "dataset = {\n", " \"meta\": {\n", " \"name\": \"example_singlelabel\",\n", " \"version\": \"1.0.0\",\n", " \"created_with\": \"wizard\"\n", " },\n", " \"config\": [\n", " {\n", " \"domain\": \"text\",\n", " \"type\": \"IText\"\n", " },\n", " {\n", " \"task_id\": \"task1\",\n", " \"classes\": [\"positive\", \"neutral\", \"negative\"],\n", " \"type\": \"TSingleClassification\"\n", " }\n", " ],\n", " \"train\": [\n", " [\n", " {\"value\": \"None of the notifications work. other people in the forums teport similar problems bht no fix. the app is nice but it isnt nearly as functional without notifications\"},\n", " {\"value\": \"negative\"},\n", " ],\n", " [\n", " {\"value\": \"It's great\"},\n", " {\"value\": \"positive\"},\n", " ],\n", " [\n", " {\"value\": \"Not allowing me to delete my account\"},\n", " {\"value\": \"negative\"},\n", " ],\n", " [\n", " {\"value\": \"So impressed that I bought premium on very first day\"},\n", " {\"value\": \"positive\"},\n", " ],\n", " ],\n", " \"test\": [\n", " [\n", " {\"value\": \"Can't set more than 7 tasks without paying an absurdly expensive weekly subscription\"},\n", " {\"value\": \"negative\"},\n", " ]\n", " ],\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For Active Learning, we have to reconfigure the dataset loader first, since usually the samples would be unlabeled and thus not suitable to train a model with. We will walk through this process, even though in this example all samples are already labeled.\n", "\n", "At the beginning, we **randomly** choose a small part that must be initially labeled (in this minimal example, we only choose a single sample at the beginning). The rest is kept back in the `rest_ds`.\n", "\n", "*Note: It is important to choose the first batch of samples randomly - even though one could let a newly initialized model choose the samples, this usually leads to decreased model performance.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dl = DatasetLoader(dataset)\n", "\n", "rest_ds = deepcopy(dl.train_ds)\n", "\n", "train_dss = []\n", "next_ds, rest_ds = random_split(rest_ds, lengths=[1, len(rest_ds) - 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the samples in the `next_ds` would be labeled in your preferred application. You can export the IDs (=positions in the original dataset) of the samples to label in order to easily identify which samples to label." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "to_label_ids = next_ds.ids" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After labeling, we update the dataset loader's training set with these samples. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_dss.append(next_ds)\n", "dl.train_ds = merge_subsets(*train_dss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we train an AutoTransformer as usual, using only the small amount of initially labeled data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "at = AutoTransformer()\n", "at.init(dataset_loader=dl, path=\".models/example09\")\n", "at.train(dl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use `at.sample()` to let the model decide on which data from the `rest_ds` to label next. This will choose new samples from the so far unlabeled set, yielding a new subset to label (of size *acquisition_size*), and another remainder. Again, we update the dataset loader with these new samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "next_ds, rest_ds = at.sample(dataset=rest_ds, acquisition_size=2)\n", "\n", "# Now we would label the new samples in next_ds\n", "\n", "train_dss.append(next_ds)\n", "dl.train_ds = merge_subsets(*train_dss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After labeling the identified samples, we can train a new AutoTransformer. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "at.finish() # Finish the old AutoTransformer to free up memory\n", "\n", "at = AutoTransformer()\n", "at.init(dataset_loader=dl, path=\".models/example09\")\n", "at.train(dl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now repeat this process (the last 2 cells) to obtain more data to label, and train progressively better models. For this example though, we will stop here. " ] } ], "metadata": { "kernelspec": { "display_name": "env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "vscode": { "interpreter": { "hash": "83ee97f1e4ad98a710574577955c6720418d3d8f987616cd4f238f891737d017" } } }, "nbformat": 4, "nbformat_minor": 2 }