Building a Scalable and Robust Data Extraction Pipeline with Apache Airflow and Cloud Platforms

EasyChair Preprint 13063

16 pages•Date: April 21, 2024

Abstract

In the era of big data, organizations face the challenge of efficiently extracting, transforming, and loading (ETL) vast amounts of data from diverse sources. This abstract presents a solution for constructing a scalable and resilient data extraction pipeline using Apache Airflow and cloud platforms. Apache Airflow, an open-source workflow management platform, provides the backbone for orchestrating complex data pipelines, while cloud platforms offer the scalability and reliability necessary to handle large-scale data processing tasks.

This abstract outlines the key components and architecture of the proposed data extraction pipeline. It begins with an overview of Apache Airflow's workflow orchestration capabilities, emphasizing its ability to schedule, monitor, and manage workflows with ease. Leveraging Airflow's extensibility through custom operators and hooks, the pipeline integrates seamlessly with various data sources and destinations, including databases, APIs, and cloud storage services.

Furthermore, this abstract highlights the advantages of utilizing cloud platforms such as Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure for hosting the data extraction pipeline. These cloud environments offer elastic computing resources, enabling the pipeline to scale dynamically in response to fluctuating workloads. Additionally, built-in services like AWS Lambda, Google Cloud Functions, or Azure Functions can be leveraged for serverless execution of data processing tasks, further enhancing scalability and cost-efficiency.

Keyphrases: Apache Airflow, Cloud Platforms, Custom operators, Data Extraction Pipeline, Data Quality Monitoring, Scalability, Serverless execution, data integration, distributed data processing, elastic computing, error handling, fault tolerance, robustness, workflow orchestration

Links:

https://easychair.org/publications/preprint/dqSx

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:13063,
  author    = {Dylan Stilinki and Kaledio Potter},
  title     = {Building a Scalable and Robust Data Extraction Pipeline with Apache Airflow and Cloud Platforms},
  howpublished = {EasyChair Preprint 13063},
  year      = {EasyChair, 2024}}

Download PDF Open PDF in browser