Multilabel text classification¶

Note: This notebook assumes that you’re familar with the basic usage of AutoTransformers. Go to the “Getting started” notebook to learn the basics.

In this example, we will train an AutoTransformer for multi-label classification. This means that each sample can have multiple labels (or none at all), in contrast to single-label, where the labels are mutually exclusive.

[ ]:

from autotransformers import AutoTransformer, DatasetLoader

As before, we prepare a minimal dataset for training the AutoTransformer. Pay attention to the values of the labels - they are lists now, instead of single items like in the single-label example. However, it is still possible (and common) to have samples with only a single label, represented by a single-item list.

[ ]:

# The text snippets in this dataset are from "stackoverflow", a dataset of StackOverflow posts and their tags.
dataset = {
    "meta": {
        "name": "example_multilabel",
        "version": "1.0.0",
        "created_with": "wizard"
    },
    "config": [
        {
            "domain": "text",
            "type": "IText"
        },
        {
            "task_id": "task1",
            "classes": ["c", "java", ".net", "php"],
            "type": "TMultiClassification"
        }
    ],
    "train": [
        [
            {"value": "is it possible to compile net il code to machine code i would like to thistribute my net programs without the net framework is it possible to compile a net program to machine code"},
            {"value": [".net"]}
        ],
        [
            {"value": "how do i show what fields a struct has in gdb i came upon a struct called ngx http variable value t in my gdb session and i would like to print what fields it has in the console is that possible"},
            {"value": ["c"]}
        ],
        [
            {"value": "character size in java vs c why does a character in java take twice as much space to store as a character in c"},
            {"value": ["c", "java"]}
        ],

        [
            {"value": "net equivalent of javas listsublist is there a net equivalent of javas listsublist that works on ilistt"},
            {"value": ["java", ".net"]}
        ],
    ],
    "test": [
        [
            {"value": "is there any general rule to decide if a type casting is done by javac compiler or by the vm"},
            {"value": ["java"]}
        ],

    ]
}

Even though the type of data is different this time, we can use the same code as in the example before for training - the type of data and task is automatically recognized by the DatasetLoader and a suitable AutoTransformer is trained.

[ ]:

dl = DatasetLoader(dataset)

# Or create a DatasetLoader from a file
# dl = DatasetLoader("path/to/my-dataset.json")

[ ]:

# In this example, we only train for one epoch to finish fast.
# In reality, you want to set this to a higher value for better results.
config = [
    ("engine/stop_condition/type", "MaxEpochs"),
    ("engine/stop_condition/value", 1),
]
at = AutoTransformer(config)

at.init(dataset_loader=dl, path=".models/example02")
at.train(dl)

Again, we check our resulting model:

[ ]:

res = at(["what size does an integer have in c", "aaaaa"])

# Formatting the result for nicer output
[{"input": input.value, "prediction": output.value} for input, output in res]