KB:Airflow on AKS

 


To deploy Apache Airflow to Azure Kubernetes Service (AKS), you can follow these general steps based on the architecture in the image. This guide includes high-level steps and common practices to integrate with components such as Redis, PostgreSQL, and monitoring tools like Dynatrace.

Prerequisites

  1. Azure Kubernetes Service (AKS): Set up an AKS cluster and ensure you have access.
  2. Azure PostgreSQL: For the metadata database.
  3. Azure File Share: For log storage.
  4. Redis: For Airflow’s task queue.
  5. GitLab: For CI/CD pipeline integration and version control.
  6. Monitoring Tool (e.g., Dynatrace): Set up for observability.

Step-by-Step Deployment Guide

1. Set Up AKS Cluster

  • Create an AKS cluster using Azure CLI or the Azure portal.
  • Ensure the cluster has sufficient nodes for Airflow components (webserver, scheduler, workers).
  • Set up a network load balancer if needed for on-premises or other integrations.

2. Deploy Airflow on AKS using Helm

Helm provides an easy way to deploy Airflow on Kubernetes. The official Apache Airflow Helm chart is recommended:

bash
helm repo add apache-airflow https://airflow.apache.org helm repo update helm install airflow apache-airflow/airflow --namespace airflow
  • Customize the Helm chart values file (values.yaml) to configure the number of workers, resources, Redis connection, and PostgreSQL database URL.

3. Configure the Database (Azure PostgreSQL)

  • Set up an Azure PostgreSQL instance for Airflow metadata.
  • In the Helm values.yaml file, configure Airflow to connect to this PostgreSQL instance.
    yaml
    metadataConnection: url: postgresql://<username>:<password>@<your-postgres-url>:5432/airflow

4. Configure Redis as Celery Broker

  • Set up Redis on Azure or within your AKS cluster.
  • Specify Redis connection information in the values.yaml for Airflow's Celery executor:
    yaml
    redis:
    host: <redis-host> port: 6379

5. Configure DAGs and Objects Sync

  • Use GitLab CI/CD for DAG management, database transformations (DBT), and object storage.
  • Set up a pipeline to sync DAGs from GitLab to the DAG folder in Airflow.

6. Set Up Persistent Storage for Logs (Azure File Share)

  • Create an Azure File Share for log storage.
  • Mount the Azure File Share to the AKS pods for logs by configuring the values.yaml file in the Airflow Helm chart.
    yaml
    logs:
    persistence: enabled: true storageClass: azurefile accessMode: ReadWriteMany

7. Configure Monitoring with Dynatrace

  • Use Dynatrace or another monitoring tool to monitor the Airflow components. This might include setting up a Dynatrace agent on each pod or using Kubernetes monitoring integrations.
  • Configure alerts to notify relevant teams (e.g., SRE, Data Engineering, Operations) based on specific incidents or performance thresholds.

8. Set Up Load Balancer and Nginx Ingress

  • Configure a load balancer (e.g., Azure Load Balancer) and an Nginx ingress controller for external access to Airflow.
  • Update the Helm values.yaml file to enable the ingress:
    yaml
    ingress:
    enabled: true hosts: - airflow.<yourdomain>.com

9. Configure Security and Access Control

  • Secure access to the Airflow UI and API with proper authentication.
  • Configure RBAC within AKS and restrict access to certain endpoints.

10. Test and Deploy

  • Test the deployment by accessing the Airflow webserver and triggering DAG runs.
  • Verify that logs are stored in Azure File Share and that the metadata is saved in PostgreSQL.
  • Confirm monitoring and alerting with Dynatrace.

This high-level guide should help you get started. Let me know if you'd like more details on any specific step!

Comments

Popular posts from this blog

KB: Azure ACA Container fails to start (no User Assigned or Delegated Managed Identity found for specified ClientId)

Electron Process Execution Failure with FSLogix

KB:RMM VS DEX (Remote Monitoring Management vs Digital Employee Experience)