Introduction to data pipelines

CDAP is a self-service, reconfigurable, extendable framework to develop, run, automate, and operate data pipelines on Spark or Hadoop. Completely open source, it is licensed under the Apache 2.0 license.

CDAP includes the Pipeline Studio, a visual click-and-drag interface for building data pipelines from a library of pre-built plugins.

CDAP provides an operational view of the resulting pipeline that allows for lifecycle control and monitoring of the metrics, logs, and other runtime information. The pipeline can be run directly in CDAP with tools such as the Pipeline Studio, the CDAP CLI, or command line tools.

Data pipelines

Pipelines are applications, specifically for the processing of data flows, created from artifacts.

An artifact is an "application template". A pipeline application is created by CDAP by using a configuration file that defines the desired application, along with whichever artifacts are specified inside the configuration. Artifacts for creating data pipelines are supplied with CDAP.

Stages and plugins

A pipeline can be viewed as consisting of a series of stages. Each stage is a usage of a plugin, an extension to CDAP that provides a specific functionality.

The configuration properties for a stage describes what that plugin is to do (read from a data source, write to a table, run a script), and is dependent on the particular plugin used.

All stages are connected together in a directed acyclic graph (or DAG), which is shown in the Pipeline Studio and in CDAP as a connected series of plugins: