Building an Enterprise Grade Azure Kubernetes Service


 You have no doubt heard about the recent trend of dockerizing everything and running your applications in Kubernetes with the promise of easing the deployment and management of your services. Many of our clients are converting their legacy services to take advantage of this trend now that Kubernetes has reached a certain age of maturity with a proven track record.


       With all the big cloud services now offering managed Kubernetes services, it has never been easier to run your services in a cluster that abstracts away many of the intricacies of containerized workloads. However, I find myself having to constantly remind our clients that just because you are running a managed service inside a cloud provider like Azure, does not give your services complete protection from common Kubernetes attack vectors, nor does it add unlimited scaling, or high availability. These are all things that you, the customer, will be responsible for configuring on your own.


        Below is a checklist that we run through with all of our clients. I will be using Azure Kubernetes Service (AKS) as an example of a managed Kubernetes service. These are items that you should consider based on your enterprise needs when configuring your cluster. Some of these features require being turned on at cluster creation time, so make sure you do your research before hand.


Cluster Security

       We always tell our clients there is no one size fits all when it comes to securing your AKS cluster, instead what we usually recommend is to take the approach of security in layers. This way, if an attacker breaches one layer of security, they still have several more layers to make it through, thus greatly decreasing the chance of their success. Below are some of our recommended “layers”:



  1. Try to refrain from exposing any public endpoints on your cluster. Use Private Endpoints anywhere possible to ensure the traffic between AKS and other service remains on the private network. AKS now offers a private endpoint for your API server to ensure traffic between your API server and node pools remains secure.
  2. Building off of point 1, there are obviously some services you may still wish to access from the public internet without directly exposing your pods. To accomplish this, we often suggest adding an Application Gateway Ingress Controller (AGIC) in front of your pods to protect your ingress traffic. AGIC further protects your cluster by offering TLS policy and Web Application Firewall (WAF) functionality.
  3. In certain scenarios you may also want to protect your cluster’s egress traffic. Many of AKS’ outbound dependencies are defined with FQDNs, which don’t have static addresses behind them. The lack of static addresses means that Network Security Groups (NSGs) can’t be used to lock down the outbound traffic from an AKS cluster. In this scenario, the simplest solution to securing outbound addresses lies in use of a firewall device, such as Azure Firewall, that can control outbound traffic based on domain names.
  4. Any container that you deploy to your cluster that includes out of date base images or unpatched application runtimes introduce a security risk and possible attack vectors. Use an image scanning tool in your CI/CD pipeline to automate the image scans, verification, and deployments. My go to image scanning tools are sysdig and Qualys, both of which have excellent integration with Azure DevOps pipelines.
  5. AKS is one of the fastest evolving resources on the Azure platform. Each updates include new features, and bug or security fixes. It is important to stay up to date. Upgrading your AKS Kubernetes instance to the latest version is as simple as pressing a button inside the Azure Portal. I always make sure to read the release notes first to plan for any potential breaking changes.
  6. We advise clients to avoid storing sensitive data directly in their container images. Instead we recommend using a third party service like Azure Key Vault to manage all your Kubernetes secrets, credentials and certificates. Using AAD Pod Identity you can easily authenticate your pods with Azure Active Directory to access your key vault.

Cluster Scaling, High Availability & Disaster Recovery

       One of the beautiful things about AKS is its ability to provide high availability using multiple nodes in a Virtual Machine Scale Set (VMSS). However, by default, AKS does not enable cluster or pod autoscaling, meaning you’re likely to only have a finite number of resources to work with. Furthermore, having multiple nodes doesn’t protect your system from a region failure. To avoid these situations and maximize your uptime, it is important to plan ahead to maintain business continuity and prepare for disaster recovery.



  1. If high availability of your services is a top priority over cost, then we normally recommend deploying two or more separate clusters, in separate regions. You can then use a Azure Traffic Manager to route traffic to the closest AKS cluster.
  2. Make use of monitoring tools to ensure the overall health of your cluster. When I originally started working with AKS I used Prometheus with Grafana as my default monitoring and alerting tool. At the time Prometheus was the industry standard. However, in the last year AKS has really improved their built in Kubernetes monitoring which can be found in the Azure Portal under the “Container Insights” tab. Make sure to enable container insights and configure any important alerts you wish to be notified of.
  3. We almost always recommend making use of the Cluster Autoscaler to our clients, no matter the size or scale of their cluster. Running cloud VMs can get quite costly, so why should you pay full price for machines that are only being used at 10% of their capacity? The AKS cluster autoscaler allows easily powering up machines when pods cant be scheduled due to lack of resources, and then powering them back down once that passes.
  4. Once you have turned on Cluster Autoscaling, your next step should be to configure your Horizontal Pod Autoscaling. The horizontal pod autoscaler uses the Metrics Server in a Kubernetes cluster to monitor the resource demand of pods. Similar to the cluster autoscaler, if an application needs more resources, the number of pods is automatically increased to meet the demand, and scales back down once that demand passes.

Cluster Permissions & Policies

The Principle of Least Privileges (POLP) is something we always push to our clients in every aspect of their business This is no different when working with Kubernetes. Kubernetes allows you to create roles that allow actions such as creating or modifying resources, or viewing logs from running application workloads. Using Kubernetes role-based access control (Kubernetes RBAC), you can grant users, groups, and service accounts access to only the resources they need.




  1. If you’re already working in Azure, chances are you are using Azure Active Directory (AD). I highly recommend integrating Azure AD with Kubernetes. Azure allows you to easily integrate your existing groups, roles and service accounts, giving you granular access right from the cluster inception.
  2. Once you’ve integrated Azure AD with your cluster, you can further restrict the access to your API server using Azure AD based roles. You can give your developers extremely granular permissions such as access to only specific namespaces, or only the ability to run certain kubectl commands.
  3. There are certain resources in AKS that in addition to deploying resources to the cluster, require an identity in order to create Azure Resources such as an Internal Load balancers for an ingress controller, or managed disks for a Statefulset. For these types of operations it is best to use a service principle or managed identity that in addition to having the ability to deploy resources to AKS, will also have the right permissions needed to create the underlying Azure resources needed.
  4. Finally, as a low bearing fruit, we usually recommend turning on the AKS control plane logging. With RBAC roles enabled, no commands should ever be run anonymously, instead you should be able to see exactly who performed a specific action.