{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Active Learning\n",
    "\n",
    "Active Learning is a technique to reduce the amount of data you manually have to label. Instead of labeling all data at the beginning, we only label a small number of samples and immediately train a model. This model then selects the most relevant data to label next, and a new model is trained with the expanded labeled set.  \n",
    "This process is then repeated until a satisfactory result is reached."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from copy import deepcopy\n",
    "from autotransformers import AutoTransformer, DatasetLoader\n",
    "from autotransformers.utils.misc import merge_subsets, random_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loading the googleplay dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The text snippets in this dataset are from \"googleplay\", a public dataset of app reviews on Google's Play Store.\n",
    "dataset = {\n",
    "    \"meta\": {\n",
    "        \"name\": \"example_singlelabel\",\n",
    "        \"version\": \"1.0.0\",\n",
    "        \"created_with\": \"wizard\"\n",
    "    },\n",
    "    \"config\": [\n",
    "        {\n",
    "            \"domain\": \"text\",\n",
    "            \"type\": \"IText\"\n",
    "        },\n",
    "        {\n",
    "            \"task_id\": \"task1\",\n",
    "            \"classes\": [\"positive\", \"neutral\", \"negative\"],\n",
    "            \"type\": \"TSingleClassification\"\n",
    "        }\n",
    "    ],\n",
    "    \"train\": [\n",
    "        [\n",
    "            {\"value\": \"None of the notifications work. other people in the forums teport similar problems bht no fix. the app is nice but it isnt nearly as functional without notifications\"},\n",
    "            {\"value\": \"negative\"},\n",
    "        ],\n",
    "        [\n",
    "            {\"value\": \"It's great\"},\n",
    "            {\"value\": \"positive\"},\n",
    "        ],\n",
    "        [\n",
    "            {\"value\": \"Not allowing me to delete my account\"},\n",
    "            {\"value\": \"negative\"},\n",
    "        ],\n",
    "        [\n",
    "            {\"value\": \"So impressed that I bought premium on very first day\"},\n",
    "            {\"value\": \"positive\"},\n",
    "        ],\n",
    "    ],\n",
    "    \"test\": [\n",
    "        [\n",
    "            {\"value\": \"Can't set more than 7 tasks without paying an absurdly expensive weekly subscription\"},\n",
    "            {\"value\": \"negative\"},\n",
    "        ]\n",
    "    ],\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For Active Learning, we have to reconfigure the dataset loader first, since usually the samples would be unlabeled and thus not suitable to train a model with. We will walk through this process, even though in this example all samples are already labeled.\n",
    "\n",
    "At the beginning, we **randomly** choose a small part that must be initially labeled (in this minimal example, we only choose a single sample at the beginning). The rest is kept back in the `rest_ds`.\n",
    "\n",
    "*Note: It is important to choose the first batch of samples randomly - even though one could let a newly initialized model choose the samples, this usually leads to decreased model performance.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dl = DatasetLoader(dataset)\n",
    "\n",
    "rest_ds = deepcopy(dl.train_ds)\n",
    "\n",
    "train_dss = []\n",
    "next_ds, rest_ds = random_split(rest_ds, lengths=[1, len(rest_ds) - 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the samples in the `next_ds` would be labeled in your preferred application. You can export the IDs (=positions in the original dataset) of the samples to label in order to easily identify which samples to label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_label_ids = next_ds.ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After labeling, we update the dataset loader's training set with these samples. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dss.append(next_ds)\n",
    "dl.train_ds = merge_subsets(*train_dss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we train an AutoTransformer as usual, using only the small amount of initially labeled data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "at = AutoTransformer()\n",
    "at.init(dataset_loader=dl, path=\".models/example09\")\n",
    "at.train(dl)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use `at.sample()` to let the model decide on which data from the `rest_ds` to label next. This will choose new samples from the so far unlabeled set, yielding a new subset to label (of size *acquisition_size*), and another remainder. Again, we update the dataset loader with these new samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "next_ds, rest_ds = at.sample(dataset=rest_ds, acquisition_size=2)\n",
    "\n",
    "# Now we would label the new samples in next_ds\n",
    "\n",
    "train_dss.append(next_ds)\n",
    "dl.train_ds = merge_subsets(*train_dss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After labeling the identified samples, we can train a new AutoTransformer. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "at.finish() # Finish the old AutoTransformer to free up memory\n",
    "\n",
    "at = AutoTransformer()\n",
    "at.init(dataset_loader=dl, path=\".models/example09\")\n",
    "at.train(dl)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now repeat this process (the last 2 cells) to obtain more data to label, and train progressively better models. For this example though, we will stop here. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "83ee97f1e4ad98a710574577955c6720418d3d8f987616cd4f238f891737d017"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}