# Dataset and DatasetLoader In general, you first create a {term}`dataset` containing the samples together with their labels that can be used to train an {term}`AutoTransformer` model. Then you use a {term}`DatasetLoader` to feed this data into an AutoTransformer during training. The general idea of this concept is shown below: Therefore, you can use two different options to load data: 1. Use the DeepOpinion dataset format (as generated by the {term}`AutoTransformers wizard`) and use the `autotransformers.DatasetLoader` to load your dataset. 2. Keep your custom dataset format and implement your own, custom {term}`DatasetLoader`. In the following subsections, we describe both different methods in more detail. ## autotransformers.DatasetLoader In order to create the correct dataset format, we suggest to use the {term}`AutoTransformers wizard`. Simply call `at wizard` in a command line and follow the guided project setup. AutoTransformers will then generate the correct `dataset.json` file for you. Check out also [this](/getting_started/getting_started_wizard) tutorial. Still, you can manually create the dataset. The following code example shows how the `dataset.json` file is structured. It defines a dataset with text inputs and two tasks, one `TSingleClassification` and one `TMultiClassification`. ```json { "meta": { "name": "MyDataset", "version": "1.0.0", "created_with": "wizard" }, "config": [ { "domain": "text", "type": "IText" }, [ { "task_id": "YOUR_UNIQUE_TASK_ID", "classes": ["class-0", "class-1", "class-2"], "type": "TSingleClassification" }, { "task_id": "YOUR_OTHER_TASK_ID", "classes": ["x", "y", "z"], "type": "TMultiClassification" } ] ], "train": [ [ {"value": "My name is John Doe and I live in Innsbruck."}, [{"value": "class-0"},{"value": ["x","z"]}] ], ["..."] ], "eval": ["..."], "test": ["..."] } ``` The `meta` object contains all additional information such as the name, a version (e.g. if you add new sampels you can increase the version) as well as how this dataset was created (e.g. wizard, manual etc.). This meta information is stored together with the trained model such that you can later reproduce how the model was created. The `config` object contains that is required to train a model successfully: The input type, the tasks as well as classes. Based on this information the model can be generated with the correct number of heads. The `train`, `eval` and `test` contain the training samples that are used during training. Each sample is a nested list `[input, [task_0, task_1, ...]]` where task_0, task_1, etc. are the labels for the different tasks (e.g. single-label and multi-label classification). The `config` object uses the same structure to define configs for the input and tasks respectively. When you generate the `dataset.json` file using the `at wizard` command, examples are included to show how to set inputs and tasks correctly. ```{note} When your dataset only defines a single task, you can omit the nested list and instead use a plain list `[input, task]` to define each sample instead. The `config` object must also omit the list around the task config in this case. ``` If you represent your data in this format, you can simply use the following code to load the data: ```python from autotransformers import DatasetLoader dl = DatasetLoader("path/to/dataset.json") ... ``` ## Custom DatasetLoader Implementations You can also keep your current dataset format and create a custom dataset loader. In the following example, it is shown how a custom dataset loader for the IMDB dataset can be implemented. The format of the dataset is shown below: ``` text aclImdb |--train | |--neg | | |--0_3.txt | | |--x_y.txt | | ... | |--pos | | |--0_9.txt | | |--x_y.txt | | ... | |--test | | | ... ``` The `train` folder contains all samples that are used to train your {term}`AutoTransformer`, while the `test` folder contains samples that are not used during training in order to evaluate the performance of your model on unseen data later on. Both folders (`train`, `test`) contain one subfolder for each class (`neg`, `pos`) that should be predicted later on. To implement a DatasetLoader correctly and ensure that the correct data is used for training, a new `ImdbDatasetLoader` class is implemented with the following requirements: 1. `config`: The config property defines the dataset configuration. More precisely, it defines (1) the input type (in this case `IText`) as well as all tasks that should be solved (in this case only one task, namely TSingleClassification). The following code-snipped shows how this property can be implemented in `ImdbDatasetLoader`: ```{eval-rst} .. literalinclude:: imdb.py :pyobject: ImdbDatasetLoader.config ``` ```{eval-rst} .. note:: The return value of `config` is a tuple of `(inputs, tasks)`. As we support multiple tasks in parallel (e.g. InformationExtraction + Classification), the tasks output is, again, a tuple as shown above. ``` 2. `train_ds`, `test_ds`, `eval_ds`: We must provide three datasets in order to train models correctly. - `train_ds`: Used to train the model - `eval_ds`: Used internally to tune hyperparameters, stop early etc. - `test_ds`: Needed to finally evaluate the model on unseen data The `ATDataset` class can be used to create those different datasets and to load samples dynamically. In the following code example, this is shown for the ImdbDataset and a detailed tutorial how this DatasetLoader can be used is shown [here](/getting_started/getting_started_manual.html) 3. `meta` (Optional): You can also provide meta information such as the version of your dataset, the name and how you created it. Simply add a property `meta` to the dataset loader and create it with `meta = DatasetMeta(...)` which can be imported via `from autotransformers.dataset_loader import DatasetMeta`. Example: ```{eval-rst} .. literalinclude:: imdb.py :pyobject: ImdbDatasetLoader.meta ``` ### Full source code ```{eval-rst} .. collapse:: Source Code :open: .. literalinclude:: imdb.py :pyobject: ImdbDatasetLoader ```