Document Information Extraction (DIE)

Note: This notebook assumes that you’re familar with the basic usage of AutoTransformers. Go to the “Getting started” notebook to learn the basics.

This example demonstrates how to extract information from documents in AutoTransformers. A “document” is an image of text with layout, such as a scanned page, screenshot, or PDF. We will use an AutoTransformer to extract specific sections of interest in the document, such as finding the name and address of the sender of a letter.

[ ]:
# Install the Huggingface datasets package
%pip install datasets
[ ]:
from datasets import load_dataset

from autotransformers import AutoTransformer, DatasetLoader
from autotransformers.utils.bbox_utils import draw_bboxes

We will use images from the funsd dataset from Huggingface, which contains scans of various types of business documents. First, we download the dataset and get some image paths.

[ ]:
funsd = load_dataset("nielsr/funsd", split="train")
filepaths = [sample["image_path"] for sample in funsd][:6]

As in the previous tutorials, we show a minimal dataset example below. For documents, each input consists of an image path and a list of bounding boxes around each word detected by an OCR scanner. The bounding boxes are given in the dict format used by ocr_wrapper.

In document information extraction, each bounding box gets a label. Thus, the labels in the dataset below are lists of classes, each label corresponding to one box. Do not confuse those lists with the label lists in multi-label classification: While the labels in the multi-label example apply to the whole sample, the labels in information extraction only apply to a single bounding box.

[ ]:
# In this dataset, we only give a single bounding box per document in order to
# keep the example small. In practice, a document usually has 100s of bounding boxes.
dataset = {
    "meta": {
        "name": "example_die",
        "version": "1.0.0",
        "created_with": "wizard",
    },
    "config": [
        {
            "ocr": "google",
            "domain": "document",
            "type": "IDocument",
        },
        {
            "task_id": "task1",
            "classes": ["O", "HEADER", "QUESTION", "ANSWER"],
            "none_class_id": 0,
            "special_token_id": -100,
            "type": "TInformationExtraction",
        },
    ],
    "train": [
        [
            {
                "image": filepaths[0],
                "bboxes": [
                    {
                        "TLx": 0.287,
                        "TLy": 0.316,
                        "TRx": 0.295,
                        "TRy": 0.316,
                        "BRx": 0.295,
                        "BRy": 0.327,
                        "BLx": 0.287,
                        "BLy": 0.327,
                        "original_width": 762,
                        "original_height": 1000,
                    },
                ],
                "texts": [
                    ":"
                ]
            },
            {"value": ["QUESTION"]},
        ],
        [
            {
                "image": filepaths[1],
                "bboxes": [
                    {
                        "TLx": 0.099,
                        "TLy": 0.129,
                        "TRx": 0.154,
                        "TRy": 0.129,
                        "BRx": 0.154,
                        "BRy": 0.139,
                        "BLx": 0.099,
                        "BLy": 0.139,
                        "original_width": 762,
                        "original_height": 1000,
                    },
                ],
                "texts": [
                    "Brand:"
                ]
            },
            {"value": ["QUESTION"]},
        ],
        [
            {
                "image": filepaths[2],
                "bboxes": [
                    {
                        "TLx": 0.423,
                        "TLy": 0.497,
                        "TRx": 0.501,
                        "TRy": 0.497,
                        "BRx": 0.501,
                        "BRy": 0.521,
                        "BLx": 0.423,
                        "BLy": 0.521,
                        "original_width": 762,
                        "original_height": 1000,
                    },
                ],
                "texts": [
                    "29Mar"
                ]
            },
            {"value": ["ANSWER"]},
        ],
        [
            {
                "image": filepaths[3],
                "bboxes": [
                    {
                        "TLx": 0.078,
                        "TLy": 0.121,
                        "TRx": 0.166,
                        "TRy": 0.121,
                        "BRx": 0.166,
                        "BRy": 0.135,
                        "BLx": 0.078,
                        "BLy": 0.135,
                        "original_width": 762,
                        "original_height": 1000,
                    },
                ],
                "texts": [
                    "SUBJECT:"
                ]
            },
            {"value": ["QUESTION"]},
        ],
    ],
    "test": [
        [
            {
                "image": filepaths[4],
                "bboxes": [
                    {
                        "TLx": 0.779,
                        "TLy": 0.084,
                        "TRx": 0.84,
                        "TRy": 0.084,
                        "BRx": 0.84,
                        "BRy": 0.095,
                        "BLx": 0.779,
                        "BLy": 0.095,
                        "original_width": 762,
                        "original_height": 1000,
                    },
                ],
                "texts": [
                    "Revision"
                ]
            },
            {"value": ["QUESTION"]},
        ]
    ],
}

From here on, training a document model is as easy as training a text model in the previous examples: Just create a DatasetLoader and start training.

[ ]:
dl = DatasetLoader(dataset)

# Or create a DatasetLoader from a file
# dl = DatasetLoader("path/to/my-dataset.json")

Training with documents takes significantly longer than for text-only data, since the AutoTransformer has to learn from both the text and the image. Therefore, don’t be surprised if the AutoTransformer trained in this short example does not perform very well.

We use the DocumentModelV4 base model for this example (a DocumentModel trained by DeepOpinion), which is a decent choice for document information extraction, and requires a moderate amount of resources due to its internal architecture.

[ ]:
# In this example, we only train for one epoch to finish fast.
# It's likely that this isn't sufficient to get a good result...
config = [
    ("engine/stop_condition/type", "MaxEpochs"),
    ("engine/stop_condition/value", 1),
]
at = AutoTransformer(config)

at.init(dataset_loader=dl, model_name_or_path="DocumentModelV4", path=".models/example04")
at.train(dl)

We can let the AutoTransformer extract information by just passing the path to an image. Note that inference by default uses Google OCR to detect and recognize text, so make sure to have the necessary credentials and packages set up.

[ ]:
# Select some random test document path
test_document = filepaths[5]
document, prediction = at(test_document)

Let’s view the model’s prediction by visualizing the bounding boxes:

[ ]:
from IPython.display import display

with_bboxes = draw_bboxes(document.image, bboxes=document.bboxes, texts=prediction.value)
display(with_bboxes)