{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simple Training\n",
    "\n",
    "This example shows the general procedure to create a dataset, train an AutoTransformer on it and then make predictions.  \n",
    "We will train an AutoTransformer for *single-label classification*, i.e. it will learn to assign exactly one label (of a number of known labels) to each sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from autotransformers import AutoTransformer, DatasetLoader"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, we have prepared a minimal dataset example. AutoTransformers' datasets are dicts, with the following keys:\n",
    "\n",
    "* `meta`: Meta-information about the dataset (name, origin, version, ...).\n",
    "* `config`: Configuration of the data. This defines what input data to expect, what tasks the AutoTransformer has, and other information about the problem you are trying to solve.\n",
    "* `train`: A list of samples used for training\n",
    "* `eval` (optional): A list of samples used for evaluating training progress, e.g. for early stopping. If this is omitted, a part of the training samples is automatically split off.\n",
    "* `test` (optional): A list of samples used for one test run at the end, using data never used for training.\n",
    "\n",
    "**Important**: Training datasets usually require hundreds or thousands of samples in order to achieve good performance on real tasks. This 4-item example only serves to show the format. The trained AutoTransformer likely does not make accurate predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The text snippets in this dataset are from \"googleplay\", a public dataset of app reviews on Google's Play Store.\n",
    "dataset = {\n",
    "    \"meta\": {\n",
    "        \"name\": \"example_singlelabel\",\n",
    "        \"version\": \"1.0.0\",\n",
    "        \"created_with\": \"wizard\"\n",
    "    },\n",
    "    \"config\": [\n",
    "        {\n",
    "            \"domain\": \"text\",\n",
    "            \"type\": \"IText\"\n",
    "        },\n",
    "        {\n",
    "            \"task_id\": \"task1\",\n",
    "            \"classes\": [\"positive\", \"neutral\", \"negative\"],\n",
    "            \"type\": \"TSingleClassification\"\n",
    "        }\n",
    "    ],\n",
    "    \"train\": [\n",
    "        [\n",
    "            {\"value\": \"None of the notifications work. other people in the forums teport similar problems bht no fix. the app is nice but it isnt nearly as functional without notifications\"},\n",
    "            {\"value\": \"negative\"},\n",
    "        ],\n",
    "        [\n",
    "            {\"value\": \"It's great\"},\n",
    "            {\"value\": \"positive\"},\n",
    "        ],\n",
    "        [\n",
    "            {\"value\": \"Not allowing me to delete my account\"},\n",
    "            {\"value\": \"negative\"},\n",
    "        ],\n",
    "        [\n",
    "            {\"value\": \"So impressed that I bought premium on very first day\"},\n",
    "            {\"value\": \"positive\"},\n",
    "        ],\n",
    "    ],\n",
    "    \"test\": [\n",
    "        [\n",
    "            {\"value\": \"Can't set more than 7 tasks without paying an absurdly expensive weekly subscription\"},\n",
    "            {\"value\": \"negative\"},\n",
    "        ]\n",
    "    ],\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each sample is a pair of input value and target value (=label). To automatically create a skeleton dataset file for your task, you can use the `at wizard` command-line tool.\n",
    "\n",
    "Now we create a DatasetLoader with our dataset, which manages data handling while training an AutoTransformer.  \n",
    "You can instantiate a DatasetLoader from a path to a JSON file containing a dataset, or directly from a Python dictionary (as we do here)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dl = DatasetLoader(dataset)\n",
    "\n",
    "# Or create a DatasetLoader from a file\n",
    "# dl = DatasetLoader(\"path/to/my-dataset.json\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can create an AutoTransformer. The `config` can be used to change various settings, such as training time, learning rate or experiment tracking. Here, we set a very short training time, and reduce the frequency of console output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# In this example, we only train for one epoch to finish fast. \n",
    "# In reality, you want to set this to a higher value for better results.\n",
    "config = [\n",
    "    (\"engine/stop_condition/type\", \"MaxEpochs\"),\n",
    "    (\"engine/stop_condition/value\", 1),\n",
    "]\n",
    "at = AutoTransformer(config)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We initialize the AutoTransformer with our dataset configuration (so it knows what data to expect), the base model (e.g. BERT or RoBERTa), and optionally a path to save checkpoints to. Then, we are ready to train. For this, we simple call `at.train()` with the dataset loader as an argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "at.init(dataset_loader=dl, model_name_or_path=\"Roberta\", path=\".models/example01\")\n",
    "at.train(dl)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try out the result!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "res = at([\"This app is amazing!\", \"While I do like the ease of use, I think it's too expensive.\"])\n",
    "\n",
    "# Formatting the result for nicer output\n",
    "[(input.value, output.value) for input, output in res]"
   ]
    }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "83ee97f1e4ad98a710574577955c6720418d3d8f987616cd4f238f891737d017"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}