Babysit the training¶

Different methods exist to babysit the training process. The most common one is to use WanDb or ClearML. To enable those modules, simply install the packages and enable those modules in the sys_config:

sys_config = [
    ("engine/modules/tracking.Wandb/enabled", True),
    ("engine/modules/tracking.Wandb/project", "MyProject"),
    ("engine/modules/tracking.ClearML/enabled", True),
    ("engine/modules/tracking.ClearML/project", "MyProject"),
    ("engine/modules/tracking.ClearML/task", "MyTask"),
]

In case a custom module is required (e.g. to send information to an on-premise server etc.) it is also possible to implement a custom module:

class MyCustomModule:
    def __init__(self) -> None:

    def on_log_dict(
        self,
        data: dict,
        is_training: bool,
        step: Optional[int],
        epoch: Optional[int],
        training_progress: Optional[int],
        **kwargs,
    ):
        """Is called after each n steps as configured in the config 'engine/log_interval'.

        Args:
            data (dict): Dictionary containing the metrics.
            is_training (bool): True if we are currently training, False if we are evaluating or testing.
            step (Optional[int]): Current step (None if not available i.e. if we evaluate or test the model).
            epoch (Optional[int]): Current epoch (None if not available i.e. if we evaluate or test the model).
            training_progress (float): Current training progress \in [0,1]. None if its not possible
                to compute a training progress (e.g. no max. number of steps -- training infinitely).
        """
        if is_training:
            print(f"Metrics at {training_progress}% training (step={step}, epoch={epoch}):")
            print(data)
            # For training we can also raise an autotransformers.exceptions.StopTraining() in order
            # to stop the training gracefully.
        else:
            print(f"Eval Metrics")
            print(data)

In order to use the sys_config or the custom module, simply pass it to the AutoTransformer init:

# Load dataset
dl = DatasetLoader(path=path)
at = AutoTransformer(sys_config)
my_custom_module = MyCustomModule()

# Initialize the autotransformer with our custom callback module
at.init(dataset_loader=dl, path=None, custom_modules=[my_custom_module])

# ACT
at.train(dl, eval_split=0.2)

Its also possible to log training after restarting training from a checkpoint. Instead of calling at.init we can load the model and continue training and pass all our custom modules to the load method:

# Load dataset
dl = DatasetLoader(path=path)
my_custom_module = MyCustomModule()
at = AutoTransformer()
    .load("my_path", custom_modules=[my_custom_module])

# Continue training
at.train(dl, eval_split=0.2)