As data continues to be the driving force of technology, organizations are building central storage pools known as data lakes. These data lakes can be built on public clouds like Amazon Web Services (AWS) to store all the structured and unstructured data at scale.
Businesses are discovering the distinct advantage of building data lakes on AWS, as it brings in security, cost savings, efficiency, and scalability. The services that are required to easily process, transform, analyze, and manage structured and unstructured data are automatically configured with the AWS data lake. The AWS data lake makes it easy for business users to search and identify data for their various needs.
A data lake holds a large amount of data. Oftentimes, it becomes difficult to access relevant data due to complex hierarchy or the sheer size of data. In such cases, ETL jobs come to the rescue, extracting and transforming data from the pool to process and provide only transformed and meaningful data to the end-user for their consumption. With the growing size of data and business use cases, organizations develop multiple ETL jobs to ingest, process, and transform the data. This increasing number of ETL jobs requires a robust DevOps implementation to be maintained and deployed efficiently.
A Cloud Development Kit (CDK) is an open-source software development framework that is used by engineers to define cloud application resources in a programming language they are comfortable with.
A CDK pipeline can be considered a library for a streamlined delivery of CDK applications. These pipelines are self-updating and when a new application stage is added, it automatically reconfigures itself to line-up these add-ons.
When it comes to AWS, CDK pipelines are contained within the AWS Cloud Development Kit (AWS CDK). AWS CDK automates release pipelines, enabling attention towards application development and delivery on the data lake. The AWS CDK pipeline helps implement a DevOps strategy for ETL jobs that ensures continuous deployment and delivery, data processing, and test cycles, which support setting up the data lake smoothly in the production environment.
To deploy ETL jobs using CDK pipelines, we need to first create a data lake infrastructure. This includes Amazon Simple Storage Service (Amazon S3) buckets:
We also need AWS Key Management Service (KMS) encryption keys, Amazon Virtual Private Cloud (Amazon VPC), subnets, VPC endpoints, route tables, AWS Secrets Manager, and security groups.
The process looks like this:
The model is based on the following design principles:
You will also need some resources to execute the deployment. These resources include CDK Application, CDK Pipelines stack, CDK Pipelines deploy stage, Amazon DynamoDB stack, AWS Glue stack, and AWS Step Functions stack.
Here’s how to go about the final solution.
Data lake ETL source code is divided into three branches – dev, test, and production. A dedicated AWS account is used to create CDK pipelines. Then each of the branches must be mapped to a CDK pipeline. Finally, you can deploy data lake ETL jobs using CDK pipelines in a few steps:
Using CDK pipelines to deploy data lake ETL jobs in dev, test, and production AWS environments is a scalable and configuration-driven deployment model. By following the steps mentioned above, you can continuously deliver your data lake ETL jobs and automate the system using CDK pipelines.