Airflow on Kubernetes with Helm
In the cloud-native world, Kubernetes and Helm are becoming the de-facto standards for deploying, managing, and scaling applications. As a data engineer, you may often find yourself in a situation where you need to manage complex data pipelines. This is where Apache Airflow comes in. Airflow is a robust open-source platform that lets you programmatically author, schedule, and monitor workflows. Combining the power of Airflow with the resilience and scalability of Kubernetes, we can create a highly reliable data pipeline management system.
Why Deploy Airflow on Kubernetes?
Deploying Airflow on Kubernetes has several advantages over other deployment methods. Traditionally, Airflow is deployed on virtual machines or bare-metal servers. As your data processing needs grow, you need to manually handle the scaling, which can be tedious and error-prone.
On the other hand, Kubernetes is an open-source platform for automating the deployment, scaling, and management of containerized applications. It provides a more efficient and seamless way to deploy, scale, and manage Airflow.
Benefits of Executors
Airflow comes with several types of executors, each having its advantages. When deploying Airflow on Kubernetes, the Kubernetes executor brings significant benefits. The Kubernetes executor creates a new pod for every task instance. It means each task runs in isolation and uses resources optimally. You don’t have to worry about one task affecting another due to shared resources.
This level of isolation makes debugging simpler. If a task fails, you can examine the pod’s logs and status without worrying about other tasks’ interference. Scaling becomes a breeze with the Kubernetes executor. It scales up when there are many tasks to run and scales down when there are fewer tasks. You only use the resources you need, leading to cost efficiency.
Deploying Airflow on Kubernetes with Helm
Helm is a package manager for Kubernetes that simplifies the deployment of applications on Kubernetes. It uses a packaging format called charts. A Helm chart is a collection of files that describe a related set of Kubernetes resources. In this tutorial, I will install and run Airlfow on Google Kubernetes Engine (GKE).
To deploy Airflow on Kubernetes using the official Airflow Helm chart, you need to follow these steps:
Notes: In addition, the community also has a helm chart, but in this article we only use the official Helm chart from Airflow.
1. Install Helm
Depending on your operating system, you can find different installation instructions in the official Helm documentation.
2. Add the Helm chart repository for Airflow
Use the command below to add the official Airflow chart repository:
helm repo add apache-airflow https://airflow.apache.org
3. Update your Helm repository
Use the command below to make sure you have the latest version of the chart:
helm repo update
4. Customize your installation and Install Airflow:
The Helm chart comes with default values that might not fit your needs.
You can override these values with a custom YAML file.
# values.yaml
# Airflow executor
executor: "KubernetesExecutor"
Use the command below to create namespace airflow:
kubectl create namespace airflow && kubectl config set-context --current --namespace=airflow
Use the command below to install Airflow in namespace airflow:
helm upgrade --install airflow apache-airflow/airflow --namespace airflow --f values.yaml
If completed, you may see results like this:
-> kubectl get pods,svc -n airflow
NAME READY STATUS RESTARTS AGE
pod/airflow-postgresql-0 1/1 Running 0 1d
pod/airflow-scheduler-8598d7458f-2bw44 3/3 Running 0 1d17h
pod/airflow-statsd-665cc8554c-6jqc4 1/1 Running 0 1d
pod/airflow-triggerer-0 3/3 Running 0 1d17h
pod/airflow-webserver-77cd74fb86-2xvhv 1/1 Running 0 1d17h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/airflow-postgresql ClusterIP 10.72.3.250 <none> 5432/TCP 1d
service/airflow-postgresql-hl ClusterIP None <none> 5432/TCP 1d
service/airflow-statsd ClusterIP 10.72.5.171 <none> 9125/UDP,9102/TCP 1d
service/airflow-triggerer ClusterIP None <none> 8794/TCP 1d
service/airflow-webserver ClusterIP 10.72.10.125 <none> 8080/TCP 1d
Now, you can access Airflow UI by using kubectl port-forward:
kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow
The web server can now be accessed on localhost:8080. The default credentials are username admin and password admin.
Automatically pull Airflow DAGs from a private GitHub repository with the git-sync feature
In the dynamic world of data engineering and workflow automation, staying agile and organized is essential. Imagine having your DAGs always up-to-date, seamlessly accessible by your team, and securely stored in a version-controlled environment. That’s where the magic of automatically pulling Airflow DAGs from a private GitHub repository comes into play.
1. Creating a private git repository and setting up the connection
To synchronize your Airflow DAGs, you need to establish a code repository where you can save your local DAGs. You are free to choose any code repository, but for this guide, GitHub will be used.
First, form a private repository to keep your DAGs.
airflow-on-k8s
└──airflow
└──dags
└── example_bash_operator.py
After this, create a deployment key for your repository to facilitate SSH access. You can generate an SSH key using ssh-keygen as shown below:
-> ssh-keygen -t rsa -b 4096 -C "your-mail@gmail.com"
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/hungnguyen/.ssh/id_rsa): airflow_ssh_key
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in airflow_ssh_key
Your public key has been saved in airflow_ssh_key.pub
Now that you have created your keygen (airflow_ssh_key and airflow_ssh_key.pub), you need to go to your GitHub repository, click into Settings, and find ‘Deploy keys’:
To add a new deploy key, you will need the content inside the keygen you just generated. It is stored at airflow_ssh_key.pub. Copy the pub key and paste it into GitHub to create your deployment key:
After creating the deploy key, you will need to create a kubectl secret inside your cluster. To do that, create a YAML file and run the following command:
# airflow-ssh-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: airflow-ssh-secret
namespace: airflow
data:
gitSshKey: <contents of running 'base64 airflow_ssh_key'> # modify here
kubectl apply -f airflow-ssh-secret.yaml
After running the create secret command, you can check your secrets by the command line:
kubectl describe secret airflow-ssh-secret -n airflow
2. Editing the airflow helm YAML file to configure the GitSync feature
Now that you have created a git repository with a deploy key and a Kubernetes secret using kubectl CLI, it’s time to edit the YAML file that is used to configure the Airflow deployment.
# values.yaml
gitSync:
enabled: true
repo: git@github.com:hungngph/airflow-on-k8s.git
branch: main
subPath: "airflow/dags"
sshKeySecret: airflow-ssh-secret
To apply the changes, just run the command:
helm upgrade airflow apache-airflow/airflow --namespace airflow --values values.yaml
This command will deploy airflow using the configuration settings inside the values.yaml file.
And that’s it! Every time you change your DAG locally and commit to your git repository, airflow will automatically apply the changes inside the airflow DAGs folder.
Integrate Google Cloud Storage for remote logging
The Airflow task in Kubernetes operates on pods, which are transient and can be initiated or terminated based on demand. If remote logging isn’t set up for your active tasks, you risk either being unable to view the logs or losing them if the pods are terminated. Setting up remote logging ensures the logs persist beyond the lifespan of the individual pods.
1. Creating a GCP service account
Create a service account and provide it with GCP access that includes Storage Object Admin permission. Then, generate a JSON key named “k8s-services-airflow-sc.json” from this service account.
Next, create Kubernetes secret from JSON key(k8s-services-airflow-sc.json) with this command:
kubectl create secret generic sc-key --from-file=key.json=/<path>/<to>/<sc>/k8s-services-airflow-sc.json -n airflow
2. Editing the airflow helm YAML file to configure the remote logging feature
# values.yaml
# Environment variables for all airflow containers
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/opt/airflow/secrets/key.json"
# Airflow scheduler settings
scheduler:
extraVolumeMounts:
- name: google-cloud-key
mountPath: /opt/airflow/secrets
extraVolumes:
- name: google-cloud-key
secret:
secretName: sc-key
# Airflow webserver settings
webserver:
extraVolumeMounts:
- name: google-cloud-key
mountPath: /opt/airflow/secrets
extraVolumes:
- name: google-cloud-key
secret:
secretName: sc-key
# Airflow triggerer settings
triggerer:
extraVolumeMounts:
- name: google-cloud-key
mountPath: /opt/airflow/secrets
extraVolumes:
- name: google-cloud-key
secret:
secretName: sc-key
config:
logging:
remote_logging: 'True'
remote_base_log_folder: 'gs://airflow/logs/'
remote_log_conn_id: 'sc-key'
google_key_path: "/opt/airflow/secrets/key.json"
To apply the changes, just run the command:
helm upgrade airflow apache-airflow/airflow --namespace airflow --values values.yaml
This command will deploy airflow using the configuration settings inside the values.yaml file.
Now, you can view remote logs from GCS in the Airflow UI:
Additionally, you can look at this page for things to consider when using this Aiflow Helm chart in a production environment
Conclusion
Deploying Airflow on Kubernetes with Helm is a powerful combination that brings scalability, resilience, and efficiency to your data pipelines. It leverages the best of both worlds: the workflow management capabilities of Airflow and the container orchestration capabilities of Kubernetes. With the added benefits of using the Kubernetes executor, you can rest assured that your data pipelines will be robust and reliable.