CDK Pipelines and their Efficacy in Deploying ETL jobs in data lakes on AWS
Data Lake on AWS & how ETL jobs help
As data continues to be the driving force of technology, organizations are building central storage pools known as data lakes. These data lakes can be built on public clouds like Amazon Web Services (AWS) to store all the structured and unstructured data at scale.
Businesses are discovering the distinct advantage of building data lakes on AWS, as it brings in security, cost savings, efficiency, and scalability. The services that are required to easily process, transform, analyze, and manage structured and unstructured data are automatically configured with the AWS data lake. The AWS data lake makes it easy for business users to search and identify data for their various needs.
A data lake holds a large amount of data. Oftentimes, it becomes difficult to access relevant data due to complex hierarchy or the sheer size of data. In such cases, ETL jobs come to the rescue, extracting and transforming data from the pool to process and provide only transformed and meaningful data to the end-user for their consumption. With the growing size of data and business use cases, organizations develop multiple ETL jobs to ingest, process, and transform the data. This increasing number of ETL jobs requires a robust DevOps implementation to be maintained and deployed efficiently.
What are CDK and CDK Pipeline?
A Cloud Development Kit (CDK) is an open-source software development framework that is used by engineers to define cloud application resources in a programming language they are comfortable with.
A CDK pipeline can be considered a library for a streamlined delivery of CDK applications. These pipelines are self-updating and when a new application stage is added, it automatically reconfigures itself to line-up these add-ons.
The Advantage of CDK Pipeline on AWS Data Lake
When it comes to AWS, CDK pipelines are contained within the AWS Cloud Development Kit (AWS CDK). AWS CDK automates release pipelines, enabling attention towards application development and delivery on the data lake. The AWS CDK pipeline helps implement a DevOps strategy for ETL jobs that ensures continuous deployment and delivery, data processing, and test cycles, which support setting up the data lake smoothly in the production environment.
Deploying ETL Jobs using CDK Pipeline on AWS
To deploy ETL jobs using CDK pipelines, we need to first create a data lake infrastructure. This includes Amazon Simple Storage Service (Amazon S3) buckets:
- Bronze – Raw input data is stored in this bucket the way data is received from various data source.
- Silver – Raw data is validated for quality, processed, enriched and distributed. Data format is also changed to Parquet or a similar columnar format for faster data retrieval.
- Gold – These buckets are purpose-built for a specific business case and data is transformed based on the needs of the business case and provided for consumption.
We also need AWS Key Management Service (KMS) encryption keys, Amazon Virtual Private Cloud (Amazon VPC), subnets, VPC endpoints, route tables, AWS Secrets Manager, and security groups.
The process looks like this:
- A file is uploaded to the bronze (raw) bucket of Amazon S3.
- The lambda function is triggered, inserting an item to the Amazon DynamoDB table to track the file processing state.
- The AWS Glue job runs, and the input data is transferred from the raw bucket to the Silver (conformed) bucket of S3 and the Data Catalog table is updated.
- The input data is then processed by another AWS Glue job to the Gold (purpose-built) bucket.
- Input data is transformed to be in line with the ETL transformation rules and the result in stored in a Parquet format in the purpose-built bucket.
- Finally, the DynamoDB table is updated, and the job status is set to completed.
- Data engineers can now analyse data via Athena or Redshift Spectrum.
Prerequisites for data lake ETL jobs deployment using CDK pipelines
The model is based on the following design principles:
- A dedicated AWS account to run CDK pipelines
- AWS accounts where the data lake is deployed. Account can be one or more depending on the usage
- A dedicated code repository providing a landing zone for the data lake
- Each ETL requires a dedicated source code repository with a unique AWS service, orchestration, and configuration requirements.
- A dedicated source code repository ensures the building, deployment, and maintenance of ETL jobs.
You will also need some resources to execute the deployment. These resources include CDK Application, CDK Pipelines stack, CDK Pipelines deploy stage, Amazon DynamoDB stack, AWS Glue stack, and AWS Step Functions stack.
Here’s how to go about the final solution.
Data lake ETL source code is divided into three branches – dev, test, and production. A dedicated AWS account is used to create CDK pipelines. Then each of the branches must be mapped to a CDK pipeline. Finally, you can deploy data lake ETL jobs using CDK pipelines in a few steps:
- Have the DevOps administrator check in the code to the repository.
- Facilitate a one-time manual deployment on a target environment.
- Let Code Pipeline update itself by listening to commit events on the source code repositories.
- Whatever changes are made to the code in the main, test, and production branch of the repository will be automatically deployed to the dev, test, and production environments of the data lake respectively.
Advantages of data lake ETL jobs deployment using CDK pipelines:
- Using CDK pipelines to deploy data lake ETL jobs has quite a lot of benefits. Here are some of them:
- The model of deployment is scalable and centralized, which helps deliver end-to-end automation. This means that engineers can maintain control over the deployment strategy and code by using the single responsibility principal.
- The fact that the model is scalable means that it can easily be expanded into multiple accounts with the pipelines being responsive to custom control within each environment.
- You get consistent management of all global configurations like resource names, regions, VPS CIDR ranges, AWS account IDs, etc. This is because the model allows configuration-driven deployment.
- You can also easily repeat the model for consistent deployment of new ETL jobs. All code changes are safely and securely propagated through all environments, allowing rapid iteration on data processing.
Using CDK pipelines to deploy data lake ETL jobs in dev, test, and production AWS environments is a scalable and configuration-driven deployment model. By following the steps mentioned above, you can continuously deliver your data lake ETL jobs and automate the system using CDK pipelines.