Data Pipeline Backfill on Airflow

BQ Qiu
4 min readMar 3, 2023

--

I recently worked on backfilling a large amount of historical data (~a few years’ worth) and ran into some Airflow gotchas. Airflow is an amazing tool for organizing ETL jobs and scheduling regular data processing. That said, it is not always an easy tool to use, and data wrangling at a large scale is a thorny problem in the best case.

Don’t set the max concurrency too high

You can control the concurrency on Airflow with the variable max_active_tasks_per_dag, previously dag_concurrency (deprecated; docs). Assuming that you’re backfilling tasks on the same DAG, this setting has pretty visible impact because you can see the number of Airflow tasks in “Running” state at one time. However, if you set the concurrency to be higher than your compute cluster can handle, tasks may fail simply by being unable to terminate gracefully.

This is not ideal because a lot of time is wasted on manually resetting the states of and re-running failed tasks. Instead of setting an ambitiously high concurrency value, experiment with a few settings and stick with one that doesn’t cause tasks to fail. You save more time in the long run.

In a crunch time, you would also want to increase resources dedicated to your compute cluster. However, scaling up the cluster may not be as simple as it seems and can come with its own set of gotchas. An example is hitting the vCPU limit for AWS Spot Instances. In this case, you also need to factor in the time needed to send in support requests to increase all such limits.

Airflow scheduler may not schedule all tasks

Say you set a batch of 1000s of tasks to be backfilled. The Airflow scheduler will take care of it, right? However, from experience and as reported by numerous other sources, later tasks may simply fail to be scheduled because the scheduler is out of resources, and these tasks are left hanging.

In addition to troubleshooting why the Airflow scheduler behaves this way and how to provide it additional resources, a quick solution is to improve your backfill technique, for example, using a bash script or chaining CLI commands to trigger subsequent tasks after the previous tasks are done, rather than backfilling many tasks at one go.

Consider the “shape” of your backfill

From the Airflow docs, the scheduler works as follows:

Once per minute, by default, the scheduler collects DAG parsing results and checks whether any active tasks can be triggered.

Therefore, there are time savings if tasks of a similar duration are run together, reducing the number of 1-minute intervals when the scheduler has not scheduled a new task to run in an available slot. When running a large number of tasks, these time savings can add up.

Remember that the backfill does not have to be done DAG run by DAG run. To shape your backfill process, consider the following:

Are you backfilling tasks with many dependencies between them within one DAG? Are there dependencies between different DAG runs (i.e. between one and the subsequent DAG run)? Do the tasks take significantly different amounts of time to run? With these considerations, you can design a sequence of backfill to run the more similar tasks together.

Airflow redeployment interrupts the backfill

When the Airflow workload itself is restarted or redeployed, tasks keep their respective states (“Running” / “Queued” / “Scheduled” etc) as Airflow uses a database to keep track of the task states. However, what may not be obvious is that running tasks can hang when Airflow is restarted. The hanged tasks are stuck in “Running” state, with no progress even in the task duration timestamp.

While this could be attributed to a bug in Airflow itself or in the tasks (for example, by not closing a connection to some other application), it is important to be aware of this issue, as it may take a long time for the data engineer to realise that the same tasks are stuck in “Running”, costing precious backfill time.

Newer tasks may fail in older DAG backfill dates

As far as I’m aware, Airflow DAG versioning is still a work in progress. When the same DAG is backfilled to older dates, tasks added after the original scheduled date of the DAG run may fail due to missing data, tables, or endpoints.

While this should not block the backfill, since the data engineer should be aware of the tasks on the critical path to the backfilled tasks, the task failures cause additional error alerting and DAG failure signposting that costs more time to triage. Data teams should also set a standard for whether such new tasks should be kept in success, failed, or clear state.

Conclusion

Airflow is a powerful tool critical to the operations of many big data and tech organizations. As there is no comprehensive list of Airflow gotchas nor a commonly recognised and regularly updated operations manual, teams should leave more buffer time for backfilling projects that involve intensive computing over a large amount of data.

--

--

BQ Qiu
BQ Qiu

Written by BQ Qiu

Computer network research, data pipeline engineering, infrastructure as code

No responses yet