Data Pipelining Concepts

4 min readMay 8, 2022

I’ve been recently thinking about data pipeline designs. Here are some key concepts or “gotchas” that are frequently not handled systematically.

Idempotency

Or, idempotence, refers to the property that applying the operation once or applying the operation several times has the same effect. Idempotent REST API endpoints means calling the endpoint multiple times should result in the same effect. Generally, only POST APIs are not expected to be idempotent. This is because POST is expected to create a new resource and hence when aPOST request is invoked N times, N new resources are expected to be created (or rejected due to conflict).GET, PUT, DELETE are all expected to be idempotent APIs.

In the context of data pipelines, the most important implication of idempotency is delete before writing to a table. Suppose your task generates processed data which is written to a table partitioned by date. The first step in the task should drop existing entries in the table in the date X partition, if you are writing to the date X partition. This ensures re-running the task does not create duplicate or unexpected extra entries.

Data Reference Period

This refers to the time period for which the data is valid, which is very likely different from the time when the data arrives in the system. Even for “live” systems, data may lag by up to a few minutes, for example. Hence it is important to distinguish between the data reference period and the data arrival time. For the former, a common field name may be timestamp / start_time & end_time / reference_period. For the latter, a good practice is to have the system add created_at timestamp field to keep track of when the data is inserted into the database.

When data arrives in batches, for example via a monthly update, one might easily assume the data arriving at this month is valid for the past month. While this is likely the case, it might not hold true all the time. The data source may provide data from even earlier months, late. It is important to clarify with the data provider if there is a reference_period field for the data, which would allow the data provider to supply “late” data. Subsequently, this has to be taken into account in downstream data pipelines.

Backfilling and Developer Utilities

Devs frequently have to deal with bad data or late data, which requires backfilling. Conceptually, backfilling simply involves running the same pipeline on the new data, which should be a relatively straightforward action. Applications should be designed with ease of devs performing backfilling in mind, e.g. through automatic triggers or UI.

Data pipelines should also be designed taking into account that backfilling may be frequently required. Dependencies between tasks must be as streamlined as possible, so that minimal backfilling is required downstream when one upstream task has bad data.

There may be trade-offs in exchange for dev utility. For example, if the data pipeline runs DELETE operations via API calls, more API calls may be needed for a custom set of DELETEs as opposed to dropping the whole partition. However, this improves the efficiency of the data pipeline and dev utility as less tasks need to be rerun. On balance, it is a good trade-off to make.

Flexible Frontend for Data Retrieval

This is less a data pipelining consideration but application design. For a product whose chief function is data warehousing, the frontend should ideally allow READ operations to a similar extent of flexibility as querying directly from the database.

A simple example may be an application which aims to show the current profile of a set of items. However, the user may not want to see only the most updated information but past information as well. The user may want to construct a historical profile and view the series of changes to-date. The user may even want to view the series of changes up to not the current time, but a time in the past.

In SQL terms, these requirements can be easily met. The underlying table schema may not even need to change. The challenge is for the frontend application to allow a wide range of possible queries and to show the data in a digestible way.

Storing Snapshots

In the first section of idempotency, we mentioned the implication of “delete before write”. However, this has the side effect of the data pipelines not retaining any record of the original data sent by the data providers. In an environment of high trust, this may not pose a problem. However, for many data applications, provision of data is contractually bound. It will then be important to retain a record of data provided.

This can be done in the database, for example by appending received data to a separate schema for record-keeping. Database writes may be more useful than logs when it comes to tracing through historical operations for troubleshooting. Raw data may also be retained in other forms of data storage. However, retaining data snapshots also comes with additional storage costs and data security considerations.

Conclusion

These are some considerations that frequently pop up in data pipeline designs. Do comment and feedback if you think of others.