# Why you should use AutoTransformers ## High Productivity We develop the {term}`AutoTransformers library` with the key requirement that users can access very complex technologies with only a few lines of code. Its even possible to generate this code automatically using the {term}`AutoTransformers wizard`. This shifts the focus from certain technology or framework (e.g. Pytorch vs. TensorFlow etc.) to the use-case that should really be solved. ## One System for all Domains and Tasks Many different methods were developed in the past for NLP, document processing or computer vision. With {term}`AutoTransformers` we offer one solution and one standardized API that can be used for all different domains. Additionally, models are automatically generated for a given problem and dataset: For example, if information should be extracted from a document and the document should be classified too, the model will automatically learn to predict both different tasks. The following list shows tasks that can be solved for different domains: - Text -Information Extraction - Single Label Classification - Multi Label Classification - Multi Single Label Classification - Document - Information Extraction - Single Label Classification - Computer Vision - Object Detection - Object Classification ## On-Premise Execution {term}`AutoTransformers` can be used on-premise to ensure that no data is shared with any party outside your company. ## Create High Performance Models without Expert Knowledge The {term}`AutoTransformers` is designed such that data can automatically be transformed into high-performance {term}`models` without the need for machine-learning expert knowledge. The {term}`AutoTransformers library` enables classical software developers to use the power of the latest DeepLearning technologies and solve tasks and problems that would otherwise not be solvable through our easy-to-use [HighLevel API](/api/api). Our research department, which consistently improves and adapts the {term}`AutoTransformers library` ensures that {term}`models` that are created with {term}`AutoTransformers` have the highest performance. ## Easy Access to Latest Technologies that also Perform Well in Real-World The progress in machine-learning and especially deep learning is incredibly fast. Often, novel findings that seem to work well and push state-of-the-art for scientific problems do not generalize to real-world problems. Real-world data is fundamentally different when compared to scientific datasets as data is not perfect, it sometimes contains inconsistent labels as data is labeled by many parties and often datasets are smaller. Therefore, new methods that work well on scientific data sometimes perform worse when used for real-world problems. Our research department developed the real-world **D**eep**O**pinion **Be**nchmarks (DOBe) that contain more than 40 real-world customer datasets with different languages, tasks and sizes. Those benchmarks are used by our researchers to extensively test novel methods and technologies on real-world problems to ensure that those work well in practice too. Our findings are then directly implemented into the {term}`AutoTransformers` and released through our [HighLevel API](/api/api) to ensure that (often very complex research) is easily accessible without the need for hundreds of lines of code. ## We Have a Strong Research Base to Provide a High Performance We not only test and implement the latest machine-learning research in {term}`AutoTransformers`, but we also push state-of-the-art in DeepLearning through our own research. Here are some selected papers that we published in high-rank journals and conferences that not only pushed research but also greatly improved our library: ### Domain Adaption {footcite:p}`domain-adaption` Aspect-Target Sentiment Classification (ATSC) is a subtask of Aspect-Based Sentiment Analysis (ABSA), which has many applications e.g. in e-commerce, where data and insights from reviews can be leveraged to create value for businesses and customers. Recently, deep transfer-learning methods have been applied successfully to a myriad of Natural Language Processing (NLP) tasks, including ATSC. Building on top of the prominent BERT language model, we approach ATSC using a two-step procedure: self-supervised domain-specific BERT language model finetuning, followed by supervised task-specific finetuning. Our findings on how to best exploit domain-specific language model finetuning enable us to produce new state-of-the-art performance on the SemEval 2014 Task 4 restaurants dataset. In addition, to explore the real-world robustness of our models, we perform cross-domain evaluation. We show that a cross-domain adapted BERT language model performs significantly better than strong baseline models like vanilla BERT-base and XLNet-base. Finally, we conduct a case study to interpret model prediction errors. ### Greedy Layer Pruning {footcite:p}`greedy-layer-pruning` Fine-tuning transformer models after unsupervised pre-training reaches a very high performance on many different natural language processing tasks. Unfortunately, transformers suffer from long inference times which greatly increases costs in production. One possible solution is to use knowledge distillation, which solves this problem by transferring information from large teacher models to smaller student models. Knowledge distillation maintains high performance and reaches high compression rates, nevertheless, the size of the student model is fixed after pre-training and can not be changed individually for a given downstream task and use-case to reach a desired performance/speedup ratio. Another solution to reduce the size of models in a much more fine-grained and computationally cheaper fashion is to prune layers after the pre-training. The price to pay is that the performance of layer-wise pruning algorithms is not on par with state-of-the-art knowledge distillation methods. In this paper, Greedy-layer pruning is introduced to (1) outperform current state-of-the-art for layer-wise pruning, (2) close the performance gap when compared to knowledge distillation, while (3) providing a method to adapt the model size dynamically to reach a desired performance/speedup tradeoff without the need of additional pre-training phases. ### The Conflicting Bundles Problem {footcite:p}`conflicting-bundle-problem` {footcite:p}`conflicting-bundle-module` Designing neural network architectures is a challenging task and knowing which specific layers of a model must be adapted to improve the performance is almost a mystery. In this paper, we introduce a novel theory and metric to identify layers that decrease the test accuracy of the trained models, this identification is done as early as at the beginning of training. In the worst-case, such a layer could lead to a network that can not be trained at all. More precisely, we identified those layers that worsen the performance because they produce conflicting training bundles as we show in our novel theoretical analysis, complemented by our extensive empirical studies. Based on these findings, a novel algorithm is introduced to remove performance decreasing layers automatically. Architectures found by this algorithm achieve a competitive accuracy when compared against the state-of-the-art architectures. While keeping such high accuracy, our approach drastically reduces memory consumption and inference time for different computer vision tasks. ### Layerwise Batch Entropy Regularization {footcite:p}`layerwise-batch-entropy` Training deep neural networks is a very demanding task, especially challenging is how to adapt architectures to improve the performance of trained models. We can find that sometimes, shallow networks generalize better than deep networks, and the addition of more layers results in higher training and test errors. The deep residual learning framework addresses this degradation problem by adding skip connections to several neural network layers. It would at first seem counter-intuitive that such skip connections are needed to train deep networks successfully as the expressivity of a network would grow exponentially with depth. In this paper, we first analyze the flow of information through neural networks. We introduce and evaluate the batch-entropy which quantifies the flow of information through each layer of a neural network. We prove empirically and theoretically that a positive batch-entropy is required for gradient descent-based training approaches to optimize a given loss function successfully. Based on those insights, we introduce batch-entropy regularization to enable gradient descent-based training algorithms to optimize the flow of information through each hidden layer individually. With batch-entropy regularization, gradient descent optimizers can transform untrainable networks into trainable networks. We show empirically that we can therefore train a "vanilla" fully connected network and convolutional neural network---no skip connections, batch normalization, dropout, or any other architectural tweak---with 500 layers by simply adding the batch-entropy regularization term to the loss function. The effect of batch-entropy regularization is not only evaluated on vanilla neural networks, but also on residual networks, autoencoders, and also transformer models over a wide range of computer vision as well as natural language processing tasks. ## References ```{eval-rst} .. footbibliography:: ```