Contents

What are Pipelines?

The Pipelines module of YData is a general-purpose job orchestrator with built-in scalability and modularity plus reporting and experiment tracking capabilities.

With automatic hardware provisioning, on-demand or scheduled execution, run fingerprinting and a UI interface for review and configuration, Pipelines equip the Platform with operational capabilities for interfacing with up/downstream systems (for instance to automate data ingestion, synthesis and transfer workflows) and with the ability to experiment at scale (crucial during the iterative development process required to discover the data improvement pipeline yielding the highest quality datasets).

YData’s Pipelines are based on Kubeflow Pipelines and can be created via an interactive interface in Labs with Jupyter Lab as the IDE (recommended) or via Kubeflow Pipeline’s Python SDK.

With its full integration with YData’s scalable architecture and the ability to leverage YData’s Python SDK, Pipelines are the recommended tool to scale up notebook work to experiment at scale or move from experimentation to production.

<aside> 👉 For a full deep dive on all the technicalities and details behind Pipelines (particularly its Python SDK, which offers advanced functionality like conditional execution), be sure to check out the official documentation of Kubeflow Pipelines.

</aside>

Anatomy of a Pipeline

An example pipeline (as seen in the Pipelines module of the dashboard), where each single-responsibility block corresponds to a step in a typical machine learning workflow

An example pipeline (as seen in the Pipelines module of the dashboard), where each single-responsibility block corresponds to a step in a typical machine learning workflow

Each Pipeline is a set of connected blocks. A block is a self-contained set of code, packaged as a container, that performs one step in the Pipeline. Usually, each Pipeline block corresponds to a single responsibility task in a workflow. In a machine learning workflow, each step would correspond to one block, i.e, data ingestion, data cleaning, pre-processing, ML model training, ML model evaluation.

Each block is parametrized by: