Data engineering teams are under pressure to deliver high-quality data fast, but the job of building and operating pipelines is becoming harder, not easier. We interviewed hundreds of data engineers and studied millions of real-world workloads and found something surprising: Data engineers spend most of their time not writing code but Operational burden arising from integration of equipment. The reason is simple: Existing data engineering frameworks force data engineers to manually handle orchestration, incremental data processing, data quality, and backfill – all common tasks for production pipelines. As the volume of data and use cases grow, this operational burden increases, making data engineering a bottleneck rather than an accelerator for business.
This is not the first time the industry has hit this wall. Initial data processing required writing a new program for each question, which did not scale. SQL changed this by asking separate queries narrative: you specify What The result you want, and the engine figures out How To calculate it. SQL databases are now the foundation of every business.
But data engineering is not about running a query. Pipelines repeatedly update multiple interdependent datasets over time. Because SQL engines stop at the query limit, everything beyond that – incremental processing, dependency management, backfill, data quality, retries – still has to be assembled by hand. At larger scales, arguments about execution order, parallelism, and failure modes increasingly become major sources of complexity.
What’s missing is a way Declare the pipeline as a whole. spark declarative pipeline (SDP) extends declarative data processing from individual queries to entire pipelines, allowing Apache Spark to plan and execute them from end to end. Instead of manually transferring data between steps, you declare What The datasets you want to exist and the SDP is responsible for How To keep them accurate with time. For example, in a pipeline that calculates weekly sales, SDP infers dependencies between datasets, creates a single execution plan, and updates the results in the correct order. It automatically processes only new or changed data, expresses data quality rules inline, and handles backfill and late-arriving data without manual intervention. Because SDP understands query semantics, it can validate pipelines in advance, execute safely in parallel, and recover correctly from failures – capabilities that require a first-class, pipeline-aware declarative API built directly into Apache Spark.
End-to-end declarative data engineering in SDP brings powerful benefits:
- More Productivity: Data engineers can focus on writing business logic rather than glue code.
- Less cost: The framework automatically handles orchestration and incremental data processing, making it more cost-effective than hand-written pipelines.
- Lower operating burden: Common use cases like backfill, data quality, and retry are integrated and automated.
To illustrate the benefits of end-to-end declarative data engineering, let’s start with a weekly sales pipeline written in PySpark. Since PySpark is not end-to-end declarative, we must manually encode execution order, incremental processing, and data quality logic, and rely on an external orchestrator like Airflow for retry, alerting, and monitoring (omitted here for brevity).
This pipeline expressed as a SQL dbt project suffers from many of the same limitations: we still have to manually code incremental data processing, data quality is controlled separately and we still have to rely on an orchestrator like Airflow to handle retries and failure:
Let’s rewrite this pipeline into SDP to explore its benefits. First, let’s install SDP and create a new pipeline:
Next, define your pipeline with the following code. Note that we comment expect_or_drop Data Quality Expectations API as we’re working with the community to open source it:
To run the pipeline, type the following commands in your terminal:
We can also verify our pipeline without running it first with this command – it’s handy to catch syntax errors and schema mismatches:
Backfilling made easy – Backfilling raw_sales table, run this command:
The code is very simple – just 20 lines that provide everything the PySpark and DBT versions require external tools to provide. We also get these powerful benefits:
- Automated incremental data processing. The framework tracks what data has been processed and reads only new or changed records. No MAX queries, no checkpoint files, no conditional logic required.
- Integrated Data Quality.
@dp.expect_or_dropThe decorator automatically strips out bad records. In PySpark, we manually split good/bad records and write to separate tables. In DBT, we needed a different model and manual handling. - Automated dependency tracking. The framework detects this
weekly_salesdepends onraw_salesAnd arranges the execution order automatically. No external orchestrator required. - Integrated retry and monitoring. The framework handles failures and provides overview through the built-in UI. No external equipment required.
sdp in Apache Spark 4.1 the following is capabilities Which makes it a great choice for data pipelines:
- Python and SQL API for defining datasets
- Support for batch and streaming queries
- Automatic dependency tracking between datasets, and efficient parallel updates
- CLI to prepare, verify, and run pipelines locally or in production
We are excited about the SDP roadmap, which is being developed openly with the Spark community. Upcoming Spark releases will build on this foundation with support for continuous execution and more efficient incremental processing. We also plan to bring core capabilities like Change Data Capture (CDC) to SDP based on real-world use cases and community feedback. Our aim is to make SDP a shared, extensible foundation for building reliable batch and streaming pipelines in the Spark ecosystem.
