# Document AutoTransformers can be used to train models on documents in order to extract information or classify them. In this GettingStarted we give a brief overview of the process and show how to use AutoTransformers to train a model on a document dataset. Documents must be preprocessed before it can be used. For the training data, this is done once for all samples. During prediction this process is executed for each sample individually: 1. Convert the PDF to an Image 2. Extract the text and bounding boxes from the image using a OCR scanner 3. Label the text with the bounding boxes for information extraction or classify the document for classification 5. Train the model using AutoTransformers 6. Predict new samples using the trained model The overall `main` procedure is shown below: ```{eval-rst} .. literalinclude:: document.py :pyobject: main ``` In the following subsections, we will go into more detail about each step. ## 1. Convert the PDF to an Image To demonstrate this process, we use only one PDF file. In a real-world scenario, you would have a folder with multiple PDF files that can be used for training or later on a new PDF that should be predicted. You can download this {download}`PDF file <document.pdf>`. First of all, we load the pdf file into memory. We not directly load the pdf file, instead we load it as an image. This is because the OCR scanner we use later on only accepts images as input: ```{eval-rst} .. literalinclude:: document.py :pyobject: load_pdf_as_image ``` ## 2. Run OCR Scanner Next we run the OCR scanner to extract the text as well as all bounding boxes from the original document. The document can be a scanned document and, therefore, OCR scanning is required. We recommend using the open-source [OCR Wrapper](https://github.com/deepopinion/ocr_wrapper) that we developed which abstracts different OCR implementations. Nevertheless, we fully support other OCR scanners as well. In this example, we use the `ocr_wrapper` and `easy_ocr` and set the ocr settings to english: ```{eval-rst} .. literalinclude:: document.py :pyobject: run_ocr ``` ## 3. Label the bounding boxes Next we label all bounding boxes that were extracted from the OCR scanner. Therefore, we list all bounding boxes and ask the user to label the text. Here you can see our label selection: ```text Please label the text (enter to skip): 'AutoTransformers' Please label the text (enter to skip): 'Example' Please label the text (enter to skip): 'Name:' Please label the text (enter to skip): 'Max Mustermann' name Please label the text (enter to skip): 'Birthday:' Please label the text (enter to skip): '17.05.1991' birthday Please label the text (enter to skip): 'Country:' Please label the text (enter to skip): 'Austria' country Please label the text (enter to skip): 'City:' Please label the text (enter to skip): 'Innsbruck' city Please label the text (enter to skip): 'DeepOpinion' company ``` ```{warning} Its not recommendet to label the bounding boxes manually. Instead, we recommend using the (DeepOpinion Studio)[https://studio.deepopinion.ai] that allows you to label the bounding boxes in a user-friendly way. The method shown here is only for demonstration purposes. ``` ## 4. Train the AutoTransformer model Next we train the AutoTransformer model. In order to train such a model we can either write a custom dataset loader or we can also format our dataset and use the generic dataset loader as we described in the {doc}`data getting started </getting_started/data>` tutorial. In this example, we decided to implement a generic dataset loader. This implementation is shown below: ```{eval-rst} .. literalinclude:: document.py :pyobject: DemoDocumentDatasetLoader ``` We can now use this `DemoDocumentDatasetLoader` to train an AutoTransformer model: ```{eval-rst} .. literalinclude:: document.py :pyobject: train ``` For more information on how to use AutoTransformers to train new models please refer to the {doc}`AutoTransformer getting started </getting_started/getting_started_manual>` tutorial. ## 5. Predict new data Finally, we can use this trained model to predict new data. We simply specify the path to the pdf document and the trained AutoTransformer model will do the rest: ```{eval-rst} .. literalinclude:: document.py :pyobject: predict ``` ```{Note} In case an OCR scanner that is not supported by the `ocr_wrapper` is used, the `predict` method can be overwritten to use the OCR scanner of choice by manually providing a `IDocument` rather than a `str` as input. The `IDocument` then contains all bounding boxes that were created by the custom OCR scanner. ``` For demonstration purposes, we plot the results into the original document and show the results below. Each bounding boxes is drawn in black if it has no label, otherwise its shown in red together with the label text: <img src="/_static/img/doc_ocr.png" style="border: 2px solid #000;" /> ## Full source code ```{eval-rst} .. collapse:: Source Code :open: .. literalinclude:: document.py ```