mednet.data.split#

Functions

check_database_split_loading(database_split, ...)

For each dataset in the split, check if all data can be correctly loaded using the provided loader function.

Classes

`CSVDatabaseSplit`(directory)	Define a loader that understands a database split (train, test, etc) in CSV format.
`JSONDatabaseSplit`(path)	Define a loader that understands a database split (train, test, etc) in JSON format.

class mednet.data.split.JSONDatabaseSplit(path)[source]#

Bases: Mapping[str, Sequence[Any]]

Define a loader that understands a database split (train, test, etc) in JSON format.

To create a new database split, you need to provide a JSON formatted dictionary in a file, with contents similar to the following:

{
    "dataset1": [
        [
            "sample1-data1",
            "sample1-data2",
            "sample1-data3",
        ],
        [
            "sample2-data1",
            "sample2-data2",
            "sample2-data3",
        ]
    ],
    "dataset2": [
        [
            "sample42-data1",
            "sample42-data2",
            "sample42-data3",
        ],
    ]
}

Your database split many contain any number of (raw) datasets (dictionary keys). For simplicity, we recommend to format all sample entries similarly so that raw-data-loading is simplified. Use the function check_database_split_loading() to test raw data loading and fine tune the dataset split, or its loading.

Objects of this class behave like a dictionary in which keys are dataset names in the split, and values represent samples data and meta-data. The actual JSON file descriptors are loaded on demand using a py:func:functools.cached_property.

Parameters:: path (Path | str | Traversable) – Absolute path to a .json formatted file containing the database split to be recognized by this object.

class mednet.data.split.CSVDatabaseSplit(directory)[source]#

Bases: Mapping[str, Sequence[Any]]

Define a loader that understands a database split (train, test, etc) in CSV format.

To create a new database split, you need to provide one or more CSV formatted files, each representing a dataset of this split, containing the sample data (one per row). Example:

Inside the directory my-split/, one can find the files train.csv, validation.csv, and test.csv. Each file has a structure similar to the following:

sample1-value1,sample1-value2,sample1-value3
sample2-value1,sample2-value2,sample2-value3
...

Each file in the provided directory defines the dataset name of the split. So, the file train.csv will contain the data from the train dataset, and so on.

Objects of this class behave like a dictionary in which keys are dataset names in the split, and values represent samples data and meta-data.

Parameters:: directory (Path | str | Traversable) – Absolute path to a directory containing the database split organized as a set of CSV files, one per dataset.

mednet.data.split.check_database_split_loading(database_split, loader, limit=0)[source]#

For each dataset in the split, check if all data can be correctly loaded using the provided loader function.

This function will return the number of errors when loading samples, and will log more detailed information to the logging stream.

Parameters:

database_split (Mapping[str, Sequence[Any]]) – A mapping that contains the database split. Each key represents the name of a dataset in the split. Each value is a (potentially complex) object that represents a single sample.
loader (RawDataLoader) – A loader object that knows how to handle full-samples or just labels.
limit (int) – Maximum number of samples to check (in each split/dataset combination) in this dataset. If set to zero, then check everything.

Returns:

Number of errors found.

Return type:

int