Glossary¶
- AutoTransformers library¶
The library that DeepOpinion developed in order to automatically transform arbitrary data into ML Models that can be used to automate processes or solve different tasks in different domains.
- AutoTransformer¶
A model that was automatically created with the AutoTransformers library using some dataset. An AutoTransformer can be used to predict new, unseen data in order to solve a task such as InformationExtraction in documents or text.
- AutoTransformers wizard¶
A command line tool that can be called with
at wizard
that helps users to create the dataset as well as train and predict scripts.- Model¶
A model is a file that was trained on a dataset to learn patterns from the data in order to predict new, unseen data later on. Models can e.g. be used to automate processes such as text classification or document information extraction. Note that we call a model that was created with the AutoTransformers library an AutoTransformer.
- Hyperparameter¶
Different parameters such as the learning rate are (usually) tuned by humans to reach high performance. The AutoTransformers library is developed such that its not really required to tune hyperparameters manually.
- Domain¶
Different data sources such as text, documents, images or speech. The AutoTransformers library supports several domains such that a large range of problems can be solved with an AutoTransformer.
- Task¶
An AutoTransformer should usually solve specific problem(s). E.g. information should be extracted from a document and the document should be classified. Classification, Information extraction etc. is then called a task. Note that a single AutoTransformer can solve several tasks, of one domain, in parallel.
- Skill¶
A model trained in a given domain on some task(s) is called a skill. For example, we call a model for document information extraction a skill. Or text classification is another skill.
- Classification¶
An umbrella term for tasks that assign classes to each sample. Currently, supported classification tasks are single-label, multi-label, and multi-single label classification.
- Single-label classification¶
This task enables a model to find exactly 1 class out of C classes for each sample.
- Multi-label classification¶
This task enables a model to find multiple (or only one, or even zero) classes out of C classes for each sample.
- Multi-single-label classification¶
Like multi-label classification, but additionally finds 1 out of C2 inner classes for each outer class.
- InformationExtraction¶
This task enables a model to extract information such as the name, address etc. from unstructured text, documents etc.
- Dataset¶
All samples (text, documents, images etc.) that are used to train and evaluate the AutoTransformer.
- Sample¶
We call a single entry in the dataset a sample. Note that each sample of the dataset is labeled by the user.
- Label¶
To train a model it is required to label the samples of the dataset in order to train the AutoTransformer correctly.
- DatasetLoader¶
The Python (iterator) implementation that is used to load your dataset from the disk into the memory in order to train an AutoTransformer on a certain node and device.
- Data pipeline¶
The dataset is not only loaded with multiple threads through the DatasetLoader, but also preprocessed by pipeline components. We call this the data pipeline.
- Device¶
A device is something that executes code during runtime e.g. a GPU, CPU or TPU.
- Node¶
A server that has multiple devices. Nodes can also be clustered in order to train models faster through data, model or pipeline parallelism.
- Active learning¶
A feature to select the samples from unlabeled data that should be labeled next where a model would benefit most.
- LLM¶
Large language models (LLM) are (usually) generative models with billions of parameters that can solve a huge range of tasks through prompt inputs.
- Performance¶
How well a model performs on new, unseen data.
- Checkpointing¶
Checkpointing stores your current training state onto the disk such that it can be continued again after the training is stopped (hardware failures etc.).