Skip to main content

Using the Airflow Integration

caution

This page is about Airflow's external integration that works mainly for Airflow versions <2.7. If you're using Airflow 2.7+, look at native Airflow OpenLineage provider documentation.

The ongoing development and enhancements will be focused on the apache-airflow-providers-openlineage package, while the openlineage-airflow will primarily be updated for bug fixes. See all Airflow versions supported by this integration

PREREQUISITES

To use the OpenLineage Airflow integration, you'll need a running Airflow instance. You'll also need an OpenLineage-compatible backend.

INSTALLATION

Before installing check supported Airflow versions.

To download and install the latest openlineage-airflow library run:

openlineage-airflow

You can also add openlineage-airflow to your requirements.txt for Airflow.

To install from source, run:

$ python3 setup.py install

CONFIGURATION

Next, specify where you want OpenLineage to send events.

We recommend configuring the client with an openlineage.yml file that tells the client how to connect to an OpenLineage backend.
See how to do it.

The simplest option, limited to HTTP client, is to use the environment variables. For example, to send OpenLineage events to a local instance of Marquez, use:

OPENLINEAGE_URL=http://localhost:5000
OPENLINEAGE_ENDPOINT=api/v1/lineage # This is the default value when this variable is not set, it can be omitted in this example
OPENLINEAGE_API_KEY=secret_token # This is only required if authentication headers are required, it can be omitted in this example

To set up an additional configuration, or to send events to targets other than an HTTP server (e.g., a Kafka topic), configure a client.

NOTE: If you use a version of Airflow older than 2.3.0, additional configuration is required.

Environment Variables

The following environment variables are available specifically for the Airflow integration, in addition to Python client variables.

NameDescriptionExample
OPENLINEAGE_AIRFLOW_DISABLE_SOURCE_CODESet to False if you want source code of callables provided in PythonOperator or BashOperator NOT to be included in OpenLineage events.False
OPENLINEAGE_EXTRACTORSThe optional list of extractors class (as semi-colon separated string) in case you need to use custom extractors.full.path.to.ExtractorClass;full.path.to.AnotherExtractorClass
OPENLINEAGE_NAMESPACEThe optional namespace that the lineage data belongs to. If not specified, defaults to default.my_namespace
OPENLINEAGE_AIRFLOW_LOGGINGLogging level of OpenLineage client in Airflow (the OPENLINEAGE_CLIENT_LOGGING variable from python client has no effect here).DEBUG

For backwards compatibility, openlineage-airflow also supports configuration via MARQUEZ_NAMESPACE, MARQUEZ_URL and MARQUEZ_API_KEY variables, instead of standard OPENLINEAGE_NAMESPACE, OPENLINEAGE_URL and OPENLINEAGE_API_KEY. Variables with different prefix should not be mixed together.

USAGE

When enabled, the integration will:

  • On TaskInstance start, collect metadata for each task.
  • Collect task input / output metadata (source, schema, etc.).
  • Collect task run-level metadata (execution time, state, parameters, etc.)
  • On TaskInstance complete, also mark the task as complete in Marquez.