Building a solutionsphere for data-centric ML using Croissant and Deep Lake

Building AI solutionw that work in the real world, requires an data-centric approach of ML. Instead of considering the strong ties to input data as a burden, embrace the fact that being strongly grounded in data is an advantage. Meaning that if real-world data is provided, the developer knows with high confidence that her solution is going to work in the wild. While data-centric ML is rather a concept of how to work, it also requires adaptation of existing ML development tools or sometimes the development of new tools. Over time the community can build a so-called solutionsphere that makes it increasingly easy to conduct data-centric ML developments. Consequently, previously infeasible tasks become feasible

By adding more and more compatible tools to a solutionsphere, previously infeasible tasks become feasible. It is important to have the solutionsphere (can be a development platform or simply a pre-configured cloud account) shareable and extendable. A critical mass of researchers and developers need to agree on a setup and use it. Otherwise, individual contributors always go back to square one to build their favorite tech stack. However this takes time, is inefficient and could make results irreproducible.

This article seeks to propose an exemplary setup that combines the power of an extended meta data exchange format (Croissant), scalable and versatile AI database for storage (Deep Lake). While the two capabilities in isolation (either only data transfer or only data storage) are powerful tools already, they really create new ways of working when used in combination.

More specifically, Croissant brings exchanging ML data to the next level.

it comes with rich metadata (think responsible AI), it is by default machine readable and is therefore full web compliant (it follows schema.org syntax) and by that one can search atomic datapoints straight from the web
it is easy to build custom croissant files and make it a breeze for data scientist to cater to new, elevated data exchange requirements (besides this it can also be used in enterprise setups, i.e. on in-house data assets never meant to be shared publicly but across data silos as they exist in big companies
it can accept a wide set of data repositories (kaggle, huggingface, openml, dataverse) and can serve to all the common ML frameworks (TF, PT, keras, JAX)
it also integrates with Apache Beam which allow building scalable compute plans, which allows dedicated cloud computing services like GCP Dataflow to execute data preparation on hundreds of workers in parallel

Check out the repository for more details: https://github.com/mlcommons/croissant

What comes after exchange of data?

Well… the data needs to be stored in its original shape and form. Do not change the file format, do not resample and most importantly do not discard any meta data. Any modification should be considered a branch from the main branch (identical concept on how to do code subversion control).

🧐

Why is it important to keep all meta data?

While for ML model training it usually only needs immediate input data x and the to-be-predicted data y, way more meta data is needed down the line. Once the ML model undergoes verification (this is different than validating on the validation data or any hold-out data set), i.e. humans check for plausibility, edge cases and model blind spots, virtually any kind of meta data can be the key to uncover hidden unknowns in the model performance. In case of medical AI for example: may be a certain configuration of a Radiological scanner let to consistently worse than average performance. Or may be a subset of the annotators with a certain demographics or professional training shows correlation with poorly performing subsets in the data. This kind of model forensic only leads to insights when contextual (i.e. meta data) data is available. Since no one knows what which fraction of data will become confounding factors (”known unknowns” vs. “unknown unknowns”) one must not exclude any data during initial data storage.

Additional requirements of a good data storage solution from a practitioner point-of-view are:

Extendability in length (i.e. more data of the same time is added of the course of a lifetime of a project)
Extendability in width (additional data is added as extra columns later; e.g. only later genetic data becomes available for subject where the medical image was available right from the start
data sub-versioning without technical overhead

similar flow as with git (push, pull, fetch, commit)
possibility to protect branches
from ML pipelining perspective - easy way to tie data subversion to code subversion (think of a grid of versioning layers)

easy to query (i.e. to slice subsets out of a big corpus)

possibility to seamlessly connect to any GenAI/LLM asset (i.e. “chat with your data”)

easy exploratory data analysis (getting a sense of the class distribution via plotting histograms; plus think df.summary())
for medical images: proper and fast viewing (w/ and w/o annotations, zooming in/out etc)
very easy to load in ML libraries
highly performant data loading (also consider random-sample and on-the-fly transformation with everything stored in a cloud bucket rather on a SSD drive)
support for credential managements if combining multiple protected data silos
possibility to do “outer joins” to build set of datasets

Exemplary schema how different layers of code and data can be subversioned without breaking any lineage (code and data). In big machine learning projects (multiple team members, on/off-boarding of individuals, timeline of 2 to 3 years, sub-teams only looking into data-preparation, model training, or model forensic) it is crucial to have any combination of “multiple moving building blocks” that was used to create a model artifact traceable and logged.

### the receiving team needs to do exploratory data analysis, it needs to integrate the data in a scalable data landscape that can serve new consumers like GenAI applications or simply scalable cloud-native data storage solutions that gracefully connect to serverless model training solutions (e.g. VertexAI, AWS Sagemaker, Google Colab [PLEASE ADD MORE] ###

Let’s do an example

go to https://datasetsearch.research.google.com/ , search for any topic and restrict results to “Croissant” (see “Croissant” button at the top)
load one or more crazy big health data asset already pre-curated on one of the data repositories
do a mix and match of multiple assets
do exploratory data analysis

in medical imaging this means looking at pixels/voxels side-by-side with descriptive meta data (usually vanilla cloud services are bad at batch visualising medical images)

filter as per inclusion and exclusion criteria

save this as a data subversion (but always keep the raw data)

build a data pre-processing pipeline to batch process everything (e.g. background subtraction, cropping, brightness enhancement)

build a dataloader that does pre-processing during runtime (watch for I/O bottlenecks) followed by e.g. image augmentation, shuffling etc (but in a reproducible way; all model trainings in high-stakes domains such as healthcare need to be 100% reproducible)

# ADD CODE HOW TO GET A PRE-BUILT CROISSANT FILE
# LOAD THE DATA AND PUT IT IN A DEEP LAKE OBJECT

AFTER THE LOADING, LET’S TAKE A LOOK AT THE DATA (w/o installing additional viewers)

show how deep lake can visualise images

FILTER FOR A CERTAIN USE CASE

show TQL capabilities (even over a union of multiple deep lake objects)
put the subset on a separate “branch” (read-only)

TRAIN SOME NICE TOY EXAMPLE

# TRAINING CODE GOES HERE

EVAL THE MODEL AND SUPERIMPOSE WITH CROISSANT META DATA TO MAKE SENSE OF MODEL BLIND SPOTS

# CODE FOR SOME QUICK AND DIRTY BLIND SPOT ANALYSIS GOES HERE