Data import and version control of medical images using Croissant format and Deep Lake

In modern data-driven workflows, advanced file exchange formats rich with metadata and versatile multi-modal storage solutions are essential for efficient data management. These tools not only facilitate seamless sharing and integration of complex datasets across diverse systems but also enable robust version control, ensuring that data transformations and updates are traceable and reproducible. Such capabilities are critical for maintaining data integrity, supporting collaboration, and accelerating innovation in research and development.

This is the outline of the code example:

Load a meta-data rich dataset using the Croissant exchange format
After data retrieval store the data in a Deeplake database for further usage
Process the data and save the subversion on a separate data branch
(optional): Merge the branch of the subversion-controlled data back to the main data branch

To get started, install the necessary dependencies:

!pip install deeplake mlcroissant scikit-image

In a next step we provide the necessary arguments. The dataset we are going to use os called HEp-2. The HEp-2 (Human Epithelial type 2) dataset is a widely used benchmark in the field of medical image analysis, especially for the task of antinuclear antibody (ANA) pattern classification [1].

Exemplary image:

dataset = 'hep2_hf'
org_id = <your-org> # CHANGE THIS ACCORDING TO YOU ORG ON https://app.activeloop.ai/
path_to_deeplake_db = f'al://{org_id}/{dataset}'

path_to_croissant_file = f'/content/{dataset}.json'

For sake of example we just create our own Croissant file. But Croissant files are provided also in Huggingface, Kaggle, OpenML, and TFDS. If you want to share confidential, in-house data you can simply create your own Croissant file (either programmatically as below or using the Croissant Editor). The file is then read like this:

import mlcroissant as mlc

dataset = mlc.Dataset(jsonld=path_to_croissant_file)
metadata = dataset.metadata.to_json()

Below code is taking the Croissant 🥐 file meant for comprehensive data sharing and turns into a general-purpose Deep Lake object. The best of two worlds 💪!

As a next step save all the metadata contained in the Croissant file in a Deeplake object.

for key in metadata:
  if key == 'recordSet': continue
  print(f"Adding Croissant metadata to deeplake DB: {key}")
  croissant_obj = metadata[key]
  if isinstance(croissant_obj, datetime):
    croissant_obj = croissant_obj.strftime("%Y-%m-%d %H:%M:%S.%f")
  ds.metadata[key] = croissant_obj

Then store the raw data (microscopy images in our case) in an deeplake object. Once done commit the data to a data main branch.

record_sets = ", ".join([f"`{rs.id}`" for rs in dataset.metadata.record_sets])

record_sets = [f"{rs.id}" for rs in dataset.metadata.record_sets]

for i in record_sets:
  ds.add_column("record_set", "text")
  ds.add_column("filename", "text")
  ds.add_column("label", "text")
  ds.add_column("image", deeplake.types.Image(sample_compression="png"))

  records_loaded = dataset.records(record_set=i)
  print("number of images in the dataset: {}".format(len(list(records_loaded))))
  for j,record in tqdm(enumerate(records_loaded), total=len(list(records_loaded))):
    arr = np.asarray(record['images/image_content'])
    new_arr = np.expand_dims(arr, axis=-1)

    if len(new_arr.shape) == 2: continue
    ds.append([{
        "record_set": i,
        "filename": record['images/image_filename'].decode("utf-8"),
        "label": record['images/label'].decode("utf-8"),
        "image": new_arr
    }])

ds.commit("initial commit after importing from Croissant")

This will create a deeplake database with the following specs.

ds.summary()

Dataset length: 13596
Columns:
  record_set: text
  filename  : text
  label     : text
  image     : kind=image, dtype=array(dtype=uint8, shape=(None, None, None))

Sanity check to load an image from deeplake and display it

i = 12000

print(ds[i]["images"]["filename"])
print(ds[i]["images"]["label"])

array = ds[i]["images"]["image"]
img = Image.fromarray(np.squeeze(array))

plt.imshow(img)
plt.axis('off') # Hide axes
plt.show()

03730.png
Speckled

Currently there only exists a single main branch.

print(ds.branches)
Branches: main

In a next step, let’s create a branch to store a subversion of the data. In this example, the background illumination is estimated and the background image and the background subtracted image are added to the deeplake object for downstream uses. Note that also the respective python function for the image transformation is saved as a metadata entry.

# Create branch
branch = ds.branch("bg_subtraction")
# Open branch
branch_ds = branch.open()

branch_ds.add_column("bg_subtracted", deeplake.types.Image(sample_compression="png"), )
branch_ds.add_column("bg", deeplake.types.Image(sample_compression="png"))

column = branch_ds["image"]
for j,entry in tqdm(enumerate(column), total=len(list(column))):
  background = restoration.rolling_ball(entry, radius = 25)
  result = entry - background
  branch_ds[j]["bg_subtracted"] = result
  branch_ds[j]["bg"] = background
  if j > 100: #for time reason just stop after 100 images
    break

branch_ds.metadata["python-function"] = "restoration.rolling_ball(entry, radius = 25)"
branch_ds.commit("rolling ball background subtraction")

Now another branch (called “bg_subtraction”) exists next to the main branch. Besides the Croissant metadata it also holds the python function that has been used to estimate the background.

print(branch_ds.branches)
print(branch_ds.metadata.keys())

Branches: bg_subtraction, main
['@context', '@type', 'name', 'description', 'conformsTo', 'creator', 'keywords', 'license', 'url', 'distribution', 'python-function']

This is how an exemplary plot of branch_ds[i]["image"], branch_ds[i]["bg"], and branch_ds[i]["bg_subtracted"] would look like:

Subtracting an estimation of the background is a common technique in biomedical image analysis. It is very beneficial for accurate object detection or segmentation. Logically, background subtraction can have an effect on the performance of a trained ML model. From a data-centric ML perspective, one could then ask questions: how can I improve background estimation to improve the accuracy of the downstream ML model? Rather than exclusively investing effort to find the optimal model architecture. An improved and well studied data preparation also has the potential to compensate for distribution shifts in the input data (imagine another microscope would generate much bright background illumination).

Lastly, if desired, the newly created branch "bg_subtraction" can be merged into the main branch. For larger teams with clear separation of responsibility (one expert does data curation, the other one does model training), it becomes crucial to agree on procedural standards. For example, apply the same subversion management logic known from software development also to the domain of training and test data.

ds.merge("bg_subtraction")
print("check main branch after merging")
ds.summary()
Dataset length: 13596
Columns:
  record_set   : text
  filename     : text
  label        : text
  image        : kind=image, dtype=array(dtype=uint8, shape=(None, None, None))
  bg_subtracted: kind=image, dtype=array(dtype=uint8, shape=(None, None, None))
  bg           : kind=image, dtype=array(dtype=uint8, shape=(None, None, None))