Text Information Extraction (TIE)

In a previous example, we have already demonstrated extracting information from documents. AutoTransformers also supports the same information extraction task for plain text. Additionally, we provide a minimal markup language for creating samples for this IE task.

This example will use the conll dataset. We will show both ways of loading a TIE dataset - the “generic” way using the AutoTransformers DatasetLoader, and an alternate method using a custom loader for markup text.

[ ]:
import json
from autotransformers import AutoTransformer, DatasetLoader
from autotransformers.domains.text.data.markup_loader import MarkupTextDatasetLoader

Generic DatasetLoader

We have already prepared the training dataset in both formats. In the first part of the example, we will use the standard AutoTransformers format.
For text information extraction, the labels are lists of 3-tuples of the form [start, end, label]. Each tuple represents a labeled sequence in the the text, with the given start and end indices.
[ ]:
# The samples in this datasets are taken from the CoNLL-2003 shared task.
dataset = {
    "meta": {
        "name": "unknown",
        "version": "1.0.0",
        "created_with": "dataset_converter",
    },
    "config": [
        {
            "domain": "text",
            "type": "IText",
        },
        {
            "task_id": "tie",
            "classes": ["None", "organization", "miscellaneous", "person", "location"],
            "none_class_id": 0,
            "special_token_id": -100,
            "type": "TInformationExtraction",
        },
    ],
    "train": [
        [
            {"value": " EU rejects German call to boycott British lamb."},
            {"value": [[[0, 4], "organization"], [[12, 19], "miscellaneous"], [[35, 43], "miscellaneous"]]},
        ],
        [
            {"value": " Peter Blackburn "},
            {"value": [[[0, 17], "person"]]},
        ],
        [
            {"value": " BRUSSELS 1996-08-22"},
            {"value": [[[0, 10], "location"]]},
        ],
        [
            {"value": "The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep."},
            {"value": [[[4, 24], "organization"], [[59, 66], "miscellaneous"], [[94, 102], "miscellaneous"]]},
        ],
    ],
    "eval": [
        [
            {"value": " LONDON 1996-08-30"},
            {"value": [[[0, 8], "location"]]},
        ],
    ],
    "test": [
        [
            {"value": "SOCCER - JAPAN GET LUCKY WIN, CHINA IN SURPRISE DEFEAT."},
            {"value": [[[9, 15], "location"], [[30, 36], "location"]]},
        ],
    ],
}
[ ]:
dl = DatasetLoader(dataset)

Training an AutoTransformer in the usual way:

[ ]:
# In this example, we only train for one epoch to finish fast.
# It's likely that this isn't sufficient to get a good result...
config = [
    ("engine/stop_condition/type", "MaxEpochs"),
    ("engine/stop_condition/value", 1),
]
at = AutoTransformer(config)

at.init(dataset_loader=dl, path=".models/example10")
at.train(dl)

Let’s test the model on an example sentence:

[ ]:
text, prediction = at("His statement was interpreted as a significant victory for Clarke and fellow pro-European Michael Heseltine, deputy prime minister.")
prediction.value

The predicted values are triples of [start, end, class]. While they are convenient for downstream processing, it’s hard to quickly interpret the result by hand. We can convert the prediction to a more human-readable markup text with the MarkupTextDatasetLoader.to_markdown function.

[ ]:
MarkupTextDatasetLoader.to_markdown(text.value, prediction.value)
[ ]:
at.finish()  # free up memory

Markup Text

The MarkupTextDatasetLoader enables us to use text annotated with HTML markup for training. While the markup format is more readable, it cannot be used for any tasks other than TIE. Also, markup datasets cannot be used to train multi-task AutoTransformers.

We show an example of the markup text format below. This dataset is equivalent to the dataset in generic format above.

[ ]:
dataset2 = {
    "train": {
        "X": [
            "<organization> EU </organization> rejects <miscellaneous> German </miscellaneous> call to boycott <miscellaneous> British </miscellaneous> lamb.",
            "<person> Peter Blackburn </person>",
            "<location> BRUSSELS </location> 1996-08-22",
            "The <organization> European Commission </organization> said on Thursday it disagreed with <miscellaneous> German </miscellaneous> advice to consumers to shun <miscellaneous> British </miscellaneous> lamb until scientists determine whether mad cow disease can be transmitted to sheep.",
        ],
    },
    "eval": {
        "X": [
            "<location> LONDON </location> 1996-08-30",
        ],
    },
    "test": {
        "X": [
            "SOCCER - <location> JAPAN </location> GET LUCKY WIN, <location> CHINA </location> IN SURPRISE DEFEAT.",
        ],
    },
}

We save the markup dataset to a JSON file, since the MarkupTextDatasetLoader can only load from a file.

[ ]:
with open("markup_dataset.json", "w") as f:
    json.dump(dataset2, f)

dl2 = MarkupTextDatasetLoader("markup_dataset.json")

Now we can train the same model as before, but this time using the markup dataset loader.

[ ]:
# In this example, we only train for one epoch to finish fast.
# In reality, you want to set this to a higher value for better results.
config = [
    ("engine/stop_condition/type", "MaxEpochs"),
    ("engine/stop_condition/value", 1),
]
at2 = AutoTransformer(config)

at2.init(dataset_loader=dl2, path=".models/example10")
at2.train(dl2)

Again, we test on an example from the CoNLL test set, and convert the prediction to markup text.

[ ]:
text, prediction = at2("His statement was interpreted as a significant victory for Clarke and fellow pro-European Michael Heseltine, deputy prime minister.")
MarkupTextDatasetLoader.to_markdown(text.value, prediction.value)