Document Classification (DC)¶

Note: This notebook assumes that you’re familar with the basic usage of AutoTransformers. Go to the “Getting started” notebook to learn the basics.

This example demonstrates how to classify pages of a document using AutoTransformers.

[ ]:

# Install the Huggingface datasets package
%pip install datasets

[3]:

from datasets import load_dataset

from autotransformers import AutoTransformer, DatasetLoader

We will use images from the funsd dataset from Huggingface, which contains scans of various types of business documents. First, we download the dataset and get some image paths. Please note that we set some random classes for the pages just to demonstrate how DC with AT works.

[4]:

funsd = load_dataset("nielsr/funsd", split="train")
filepaths = [sample["image_path"] for sample in funsd][:6]

Downloading builder script: 100%|██████████| 4.54k/4.54k [00:00<00:00, 10.6MB/s]
Downloading data: 100%|██████████| 16.8M/16.8M [00:02<00:00, 6.51MB/s]
Generating train split: 149 examples [00:00, 158.81 examples/s]
Generating test split: 50 examples [00:00, 161.39 examples/s]

As in the previous tutorials, we show a minimal dataset example below. For documents, each input consists of an image path and a list of bounding boxes around each word detected by an OCR scanner. The bounding boxes are given in the dict format used by ocr_wrapper.

In document classification, each page gets a label. Thus, the labels in the dataset below are simply a string (compared to DIE where each BBox must be labeled).

[8]:

# In this dataset, we only give a single bounding box per document in order to
# keep the example small. In practice, a document usually has 100s of bounding boxes.
dataset = {
    "meta": {
        "name": "example_dc",
        "version": "1.0.0",
        "created_with": "wizard",
    },
    "config": [
        {
            "ocr": "google",
            "domain": "document",
            "type": "IDocument",
        },
        {
            "task_id": "dc_single",
            "classes": ["class_0", "class_1"],
            "type": "TSingleClassification"
        },
    ],
    "train": [
        [
            {
                "image": filepaths[0],
                "bboxes": [
                    {
                        "TLx": 0.287,
                        "TLy": 0.316,
                        "TRx": 0.295,
                        "TRy": 0.316,
                        "BRx": 0.295,
                        "BRy": 0.327,
                        "BLx": 0.287,
                        "BLy": 0.327,
                        "original_width": 762,
                        "original_height": 1000,
                    },
                ],
                "texts": [
                    ":"
                ]
            },
            { "value": "class_0"},
        ],
        [
            {
                "image": filepaths[1],
                "bboxes": [
                    {
                        "TLx": 0.099,
                        "TLy": 0.129,
                        "TRx": 0.154,
                        "TRy": 0.129,
                        "BRx": 0.154,
                        "BRy": 0.139,
                        "BLx": 0.099,
                        "BLy": 0.139,
                        "original_width": 762,
                        "original_height": 1000,
                    },
                ],
                "texts": [
                    "Brand:"
                ]
            },
            {"value": "class_1"},
        ],
        [
            {
                "image": filepaths[2],
                "bboxes": [
                    {
                        "TLx": 0.423,
                        "TLy": 0.497,
                        "TRx": 0.501,
                        "TRy": 0.497,
                        "BRx": 0.501,
                        "BRy": 0.521,
                        "BLx": 0.423,
                        "BLy": 0.521,
                        "original_width": 762,
                        "original_height": 1000,
                    },
                ],
                "texts": [
                    "29Mar"
                ]
            },
            {"value": "class_0"},
        ],
        [
            {
                "image": filepaths[3],
                "bboxes": [
                    {
                        "TLx": 0.078,
                        "TLy": 0.121,
                        "TRx": 0.166,
                        "TRy": 0.121,
                        "BRx": 0.166,
                        "BRy": 0.135,
                        "BLx": 0.078,
                        "BLy": 0.135,
                        "original_width": 762,
                        "original_height": 1000,
                    },
                ],
                "texts": [
                    "SUBJECT:"
                ]
            },
            {"value": "class_1"},
        ],
    ],
    "test": [
        [
            {
                "image": filepaths[4],
                "bboxes": [
                    {
                        "TLx": 0.779,
                        "TLy": 0.084,
                        "TRx": 0.84,
                        "TRy": 0.084,
                        "BRx": 0.84,
                        "BRy": 0.095,
                        "BLx": 0.779,
                        "BLy": 0.095,
                        "original_width": 762,
                        "original_height": 1000,
                    },
                ],
                "texts": [
                    "Revision"
                ]
            },
            {"value": "class_1"},
        ]
    ],
}

From here on, training a document model is as easy as training a text model in the previous examples: Just create a DatasetLoader and start training.

[9]:

dl = DatasetLoader(dataset)

# Or create a DatasetLoader from a file
# dl = DatasetLoader("path/to/my-dataset.json")

Training with documents takes significantly longer than for text-only data, since the AutoTransformer has to learn from both the text and the image. Therefore, don’t be surprised if the AutoTransformer trained in this short example does not perform very well.

We use the DocumentModelV4 base model for this example (a DocumentModel trained by DeepOpinion), which is a decent choice for document information extraction, and requires a moderate amount of resources due to its internal architecture.

[15]:

# In this example, we only train for one epoch to finish fast.
# It's likely that this isn't sufficient to get a good result...
config = [
    ("engine/stop_condition/type", "MaxEpochs"),
    ("engine/stop_condition/value", 10),
]
at = AutoTransformer(config)

at.init(dataset_loader=dl, model_name_or_path="DocumentModelV4", path=".models/example11")
at.train(dl)



Metric    Value
--------  -------

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Train: 100%|██████████| 1/1 [00:00<00:00,  6.06it/s]
Train: 100%|██████████| 1/1 [00:00<00:00,  8.65it/s]
Train: 100%|██████████| 1/1 [00:00<00:00,  8.99it/s]
Train: 100%|██████████| 1/1 [00:00<00:00,  9.11it/s]
Train: 100%|██████████| 1/1 [00:00<00:00,  9.02it/s]
Train: 100%|██████████| 1/1 [00:00<00:00,  8.80it/s]
Train: 100%|██████████| 1/1 [00:00<00:00,  8.57it/s]
Train: 100%|██████████| 1/1 [00:00<00:00,  8.83it/s]
Train: 100%|██████████| 1/1 [00:00<00:00,  8.64it/s]
Train: 100%|██████████| 1/1 [00:00<00:00,  8.65it/s]


Train:   0%|          | 0/1 [00:00<?, ?it/s]



Metric    Value
--------  -------

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)





Train:   0%|          | 0/1 [00:00<?, ?it/s]



Test: 100%|██████████| 1/1 [00:00<00:00,  6.98it/s]



Metric                          Value
--------------------------  ---------
dc_single/test/accuracy     1
dc_single/test/weighted_f1  1
dc_single/test/loss         0.0277863

Train:   0%|          | 0/1 [02:23<?, ?it/s]

[15]:

{'dc_single/test/accuracy': 1.0,
 'dc_single/test/weighted_f1': 1.0,
 'dc_single/test/confusion_matrix': ([[0.0, 0.0], [0.0, 1.0]],
  ['class_0', 'class_1']),
 'dc_single/test/loss': 0.0277862548828125}

We can now use the trained AutoTransformer to classify page by just passing the path to an image. Note that inference by default uses Google OCR to detect and recognize text, so make sure to have the necessary credentials and packages set up. Therefore, inference can take a bit.

[17]:

# Select some random test document path
test_document = filepaths[5]
document, pred = at(test_document)
print(f"Predicted '{pred.value}' with confidence {pred.confidence}.")

Predicted 'class_0' with confidence 0.21813350915908813.

If you wonder why the confidence is only 0.21? Well we trained the model only on very little data so its very likely that the model is not very confident in its prediction. More data and more training time would increase the confidence for sure…