mednet.data.datamodule#
Classes
|
A simplified version of our DataModule for a single split. |
|
A conveninent DataModule with dictionary split loading, mini- batching, parallelisation and caching, all in one. |
- class mednet.data.datamodule.ConcatDataModule(splits, database_name='', split_name='', cache_samples=False, balance_sampler_by_class=False, batch_size=1, batch_chunk_count=1, drop_incomplete_batch=False, parallel=-1)[source]#
Bases:
LightningDataModuleA conveninent DataModule with dictionary split loading, mini- batching, parallelisation and caching, all in one.
Instances of this class can load and concatenate an arbitrary number of data-split (a.k.a. protocol) definitions for (possibly disjoint) databases, and can manage raw data-loading from disk. An optional caching mechanism stores the data in associated CPU memory, which can improve data serving while training and evaluating models.
This DataModule defines basic operations to handle data loading and mini-batch handling within this package’s framework. It can return
torch.utils.data.DataLoaderobjects for training, validation, prediction and testing conditions. Parallelisation is handled by a simple input flag.- Parameters:
splits (
Mapping[str,Sequence[tuple[Sequence[Any],RawDataLoader]]]) –A dictionary that contains string keys representing dataset names, and values that are iterables over a 2-tuple containing an iterable over arbitrary, user-configurable sample representations (potentially on disk or permanent storage), and
typing.RawDataLoader(or “sample”) loader objects, which concretely implement a mechanism to load such samples in memory, from permanent storage.Sample representations on permanent storage may be of any iterable format (e.g. list, dictionary, etc.), for as long as the assigned
typing.RawDataLoadercan properly handle it.Tip
To check the split and that the loader function works correctly, you may use
split.check_database_split_loading().This class expects at least one entry called
trainto exist in the input dictionary. Optional entries arevalidation, andtest. Entries namedmonitor-...will be considered extra datasets that do not influence any early stop criteria during training, and are just monitored beyond thevalidationdataset.database_name (
str) – The name of the database, or aggregated database containing the raw-samples served by this data module.split_name (
str) – The name of the split used to group the samples into the various datasets for training, validation and testing.cache_samples (
bool) – If set, then issue raw data loading duringprepare_data(), and serves samples from CPU memory. Otherwise, loads samples from disk on demand. Running from CPU memory will offer increased speeds in exchange for CPU memory. Sufficient CPU memory must be available before you set this attribute toTrue. It is typically useful for relatively small datasets.balance_sampler_by_class (
bool) – If set, then modifies the random sampler used during training and validation to balance sample picking probability, making sample across classes and datasets equitable.batch_size (
int) – Number of samples in every training batch (this parameter affects memory requirements for the network). If the number of samples in the batch is larger than the total number of samples available for training, this value is truncated. If this number is smaller, then batches of the specified size are created and fed to the network until there are no more new samples to feed (epoch is finished). If the total number of training samples is not a multiple of the batch-size, the last batch will be smaller than the first, unlessdrop_incomplete_batchis set totrue, in which case this batch is not used.batch_chunk_count (
int) – Number of chunks in every batch (this parameter affects memory requirements for the network). The number of samples loaded for every iteration will bebatch_size/batch_chunk_count.batch_sizeneeds to be divisible bybatch_chunk_count, otherwise an error will be raised. This parameter is used to reduce the number of samples loaded in each iteration, in order to reduce the memory usage in exchange for processing time (more iterations). This is especially interesting when one is running on GPUs with limited RAM. The default of 1 forces the whole batch to be processed at once. Otherwise the batch is broken into batch-chunk-count pieces, and gradients are accumulated to complete each batch.drop_incomplete_batch (
bool) – If set, then may drop the last batch in an epoch in case it is incomplete. If you set this option, you should also consider increasing the total number of training epochs, as the total number of training steps may be reduced.parallel (
int) – Use multiprocessing for data loading: if set to -1 (default), disables multiprocessing data loading. Set to 0 to enable as many data loading instances as processing cores available in the system. Set to >= 1 to enable that many multiprocessing instances for data loading.
- DatasetDictionary#
A dictionary of datasets mapping names to actual datasets.
- property parallel: int#
Whether to use multiprocessing for data loading.
Use multiprocessing for data loading: if set to -1 (default), disables multiprocessing data loading. Set to 0 to enable as many data loading instances as processing cores available in the system. Set to >= 1 to enable that many multiprocessing instances for data loading.
It sets the parameter
num_workers(from DataLoaders) to match the expected pytorch representation. For macOS machines, it also sets themultiprocessing_contextto usespawninstead of the default.The mapping between the command-line interface
parallelsetting works like this:Table 1 Relationship between paralleland DataLoader parameters#CLI
paralleltorch.utils.data.DataLoaderkwargsComments
<00
Disables multiprocessing entirely, executes everything within the same processing context
0Runs mini-batch data loading on as many external processes as CPUs available in the current machine
>=1parallelRuns mini-batch data loading on as many external processes as set on
parallel- Returns:
The value of self._parallel.
- Return type:
- property model_transforms: list[Callable[[Tensor], Tensor]] | None#
Transform required to fit data into the model.
A list of transforms (torch modules) that will be applied after raw-data-loading. and just before data is fed into the model or eventual data-augmentation transformations for all data loaders produced by this DataModule. This part of the pipeline receives data as output by the raw-data-loader, or model-related transforms (e.g. resize adaptions), if any is specified. If data is cached, it is cached after model-transforms are applied, as that is a potential memory saver (e.g., if it contains a resizing operation to smaller images).
- Returns:
A list containing the model tansforms.
- Return type:
- property balance_sampler_by_class: bool#
Whether to balance samples across labels/datasets.
If set, then modifies the random sampler used during training and validation to balance sample picking probability, making sample across classes and datasets equitable.
Warning
This method does NOT balance the sampler per dataset, in case multiple datasets compose the same training set. It only balances samples acording to their ground-truth (labels). If you’d like to have samples balanced per dataset, then implement your own data module inheriting from this one.
- Returns:
True if self._train_sample is set, else False.
- Return type:
- set_chunk_size(batch_size, batch_chunk_count)[source]#
Coherently set the batch-chunk-size after validation.
- Parameters:
batch_size (
int) – Number of samples in every training batch (this parameter affects memory requirements for the network). If the number of samples in the batch is larger than the total number of samples available for training, this value is truncated. If this number is smaller, then batches of the specified size are created and fed to the network until there are no more new samples to feed (epoch is finished). If the total number of training samples is not a multiple of the batch-size, the last batch will be smaller than the first, unlessdrop_incomplete_batchis set totrue, in which case this batch is not used.batch_chunk_count (
int) – Number of chunks in every batch (this parameter affects memory requirements for the network). The number of samples loaded for every iteration will bebatch_size/batch_chunk_count.batch_sizeneeds to be divisible bybatch_chunk_count, otherwise an error will be raised. This parameter is used to reduce number of samples loaded in each iteration, in order to reduce the memory usage in exchange for processing time (more iterations). This is especially interesting when one is running on GPUs with limited RAM. The default of 1 forces the whole batch to be processed at once. Otherwise the batch is broken into batch-chunk-count pieces, and gradients are accumulated to complete each batch.
- Return type:
- setup(stage)[source]#
Set up datasets for different tasks on the pipeline.
This method should setup (load, pre-process, etc) all datasets required for a particular
stage(fit, validate, test, predict), and keep them ready to be used on one of the _dataloader() functions that are pertinent for such stage.If you have set
cache_samples, samples are loaded at this stage and cached in memory.- Parameters:
stage (
str) –Name of the stage in which the setup is applicable. Can be one of
fit,validate,testorpredict. Each stage typically uses the following data loaders:fit: uses both train and validation datasetsvalidate: uses only the validation datasettest: uses only the test datasetpredict: uses only the test dataset
- Return type:
- teardown(stage)[source]#
Unset-up datasets for different tasks on the pipeline.
This method unsets (unload, remove from memory, etc) all datasets required for a particular
stage(fit, validate, test, predict).If you have set
cache_samples, samples are loaded and this may effectivley release all the associated memory.- Parameters:
stage (
str) –Name of the stage in which the teardown is applicable. Can be one of
fit,validate,testorpredict. Each stage typically uses the following data loaders:fit: uses both train and validation datasetsvalidate: uses only the validation datasettest: uses only the test datasetpredict: uses only the test dataset
- Return type:
- class mednet.data.datamodule.CachingDataModule(database_split, raw_data_loader, **kwargs)[source]#
Bases:
ConcatDataModuleA simplified version of our DataModule for a single split.
Apart from construction, the behaviour of this DataModule is very similar to its simpler counterpart, serving training, validation and test sets.
- Parameters:
database_split (
Mapping[str,Sequence[Any]]) –A dictionary that contains string keys representing dataset names, and values that are iterables over sample representations (potentially on disk). These objects are passed to an unique
typing.RawDataLoaderfor loading thetyping.Sampledata (and metadata) in memory. It therefore assumes the whole split is homogeneous and can be loaded in the same way.Tip
To check the split and the loader function works correctly, you may use
split.check_database_split_loading().This class expects at least one entry called
trainto exist in the input dictionary. Optional entries arevalidation, andtest. Entries namedmonitor-...will be considered extra datasets that do not influence any early stop criteria during training, and are just monitored beyond thevalidationdataset.raw_data_loader (
RawDataLoader) – An object instance that can load samples and labels from storage.**kwargs – List of named parameters matching those of
ConcatDataModule, other thansplits.