At Venafi and Jetstack, we care about doing Machine Identity Management right. Kubernetes environments are an ever growing part of this challenge. Securing Kubernetes workloads presents a new set of challenges and opportunities. In this post we’ll talk about how best to secure communication between Kubernetes workloads and how to evaluate your options.
A workload is a running instance of an application. Workload identities are how workloads trust and get trusted by each other. Workloads need to communicate with other workloads to function and, in doing so, need a mechanism to prove their identity to others. At the same time, they often need to be able to validate the identity of callers. There are many ways to solve this problem, but it can be a tricky one to get right.
We created cert-manger at Jetstack, a Venafi company, to make issuing certificates easier in Kubernetes environments. Certificates form part of a system used to prove and verify identities. Through our work on this project - and with our customers - we have seen different problems relating to identities, we’ve also helped build many different solutions. Some things don’t change though and we still get regular questions about securing workloads and workload identities.
This post is for platform team members thinking about standardizing workload identities in their environments. It’s also for anyone working in a microservices world fed up with manual tasks like rotating certificates, or manually managing shared secrets. This post should make you aware of your options and how to evaluate them. It also presents a system design you might want to implement or borrow from as you improve your workload identity story.
To evaluate a workload identity system, we need to know what a good one looks like. A best practice workload identity system would have these properties:
1. Workload identities can’t be captured in transit and replayed by a bad actor.
2. Trust for a given workload identity should be configurable in order to:
3. Serving workloads know their caller’s identity. Seeing the caller identity allows workloads to:
4. Workload identities are short lived and regularly rotated automatically. Rotated identities and identities for new workloads can be added without needing to update other systems.
Meeting all of these properties might be more complicated than you realise. Many approaches to workload identities fall short in one or more ways.
Later in this post, we outline a workload identity system based on SPIFFE and Trust Domains which archives this. SPIFFE is a carefully designed, cloud agnostic standard for representing workload identities based on X.509 certificates (SVIDs). The system uses Trust Domains to define boundaries of where an SVID is trusted.
Before getting to SPIFFE and Trust Domains in more detail, we’ll go over some more familiar systems and see where they fall short of a gold-standard for workload identity.
Going back to basics, one way for workloads to identify themselves is with a shared secret. You have likely used this when configuring a workload to use a secret such as an API token, key or username & password as part of the call. The workload serving the request is then required to look up the token in some way to ensure that it’s authorized to perform the operation being carried out in the request.
This simple mechanism fails on three out of four goals above:
We can do a lot better than this.
What about using certificates? They can’t be replayed and they’re easier to rotate…
It’s easy to get certificates in Kubernetes using cert-manager. While maintaining cert-manager, we’ve seen some unusual applications of the tool for workload identity use cases with public certs.
Most cert-manager users first deploy cert-manager to set up certificates for public ingress. cert-manager’s ACME support is well documented and using the ACME issuer type with Let’s Encrypt is a common use case users learn and understand.
Users sometimes continue down this path when looking to set up workload identities with certificates - this is where things start to go wrong. Firstly, they need to be able to respond to ACME http or DNS challenges, this involves some automation of an HTTP endpoint or DNS configuration. Once over that hurdle, the next challenge is distributing the certificate and key material to all workload instances - and updating them all when the certificate needs to be renewed.
On top of that, this solution is also hard to scale. As it’s using a public CA for a use case it wasn’t intended for, users can run into rate limiting which impact the uptime of their private workload identity infrastructure. In addition, issued certificates are also published to Certificate Transparency Logs which can expose details about internal infrastructure to malicious third parties.
There are some big problems, but we’re getting closer to our goal by using certificates. However, this elementary implementation still falls short:
Deploying a service mesh can quickly improve our certificate-based workload identity system. Service meshes can automate the renewal and distribution of certificates as well as using private certificate authorities to avoid the rate-limiting issue with public CAs. cert-manager even has support for Istio via istio-csr. So what’s missing from a service mesh implementation?
Service meshes get us really close, but still fall short in two important ways:
We’re seeing a pattern. There are many ways to implement machine identities yet few of them are entirely suitable. In addition, identities are often implemented in many ways in a single company platform. This only further hampers our ability to build the production ready workload identity system we need.
To begin with, we need a standard identity framework. SPIFFE is a CNCF project which is just that.
SPIFFE describes a standardized X.509 certificate (SVID) format containing a SPIFFE ID. This ID is a workload’s identity and is used by client and server workloads to verify each other. SPIFFE IDs contained in a certificate can be the basis for an mTLS connection between two workload instances where both workloads know the identity of each other.
Note that this is different from the service mesh use case where the ID is hidden from workloads by proxies (even though in Istio’s case, SPIFFE is used under the hood). A system where authorization decisions can be made based on workload identities has the potential to be more secure.
SPIFFE doesn’t get us there on its own though. SPIFFE mTLS only addresses points 1 & 3. We now have identities which can’t be replayed and have mTLS connections between workloads with identities. We’re not done yet, we still need a means of provisioning and rotating SPIFFE IDs as well as a system to control the scope at which the ID is valid. Let’s see how we can do that.
Now we have a good standard for identity we can use everywhere in our platform, we need a way to provision it. cert-manager / csi-driver-spiffe is a great way to get up and running with SPIFFE for Kubernetes workloads.
New workload deployments and workload instances can get SPIFFE IDs (contained in SVIDs) automatically. These SVIDs are also automatically rotated by cert-manager.
It’s also possible to issue SPIFFE IDs to workloads using SPIRE. SPIRE is more complicated to deploy but can support non-Kubernetes workloads as well as various other use cases.
This solves for point 4, however we still need a way to correctly scope mutual trust for workloads.
So far, workloads have identities, they’re rotated, they’re not vulnerable to replay attacks and they’re standardized with SPIFFE. However, we’re still missing an important part - they’re not trusted. Workload SVIDs (certificates containing a SPIFFE ID) must be signed by a CA and the CA public key needs to be distributed to other workload instances (likely as part of a bundle of trusted public CA certificates) that might be interacting with that workload.
cert-manager/trust helps us here. Trust can make sure that the public CA certificates for a Trust Domain are present in each workload prior to it starting.
Workloads operate in one or more Trust Domains. A group of public CAs certificates needed for a group of workloads in a Trust Domain to trust each other is called a Trust Bundle.
Trust Bundles might need to be distributed to another trust domain if workloads need to communicate between different trust domains. Note here though that the default is for identities not to be trusted and trust is only granted as trust bundles are synced to other locations. This makes it easy to strictly control the scope of identities and limit the blast radius of a compromised workload.
This solves for point 2, the scope of trust for a workload identity can be carefully controlled by how Trust Bundles are distributed.
Armed with Trust Bundles and their own identities in SVIDs, workloads can verify identities of other workloads and open mTLS sessions with them successfully.
We’ve seen how our proposed system meets all out workload identity goals. Now let’s see it in practice. We have an example of a simple system consisting of of two workloads which use SPIFFE IDs for workload identities.
In our example, there are two workloads: a client and a server. The client will call the server and the server will respond with a message. Both the client and the server will be aware of each other’s identities when communicating.
Before we can get there though, we need to put some infrastructure in place to make it all possible.
First, we’re going to need to install cert-manager. You can follow the instructions on the cert-manager website to get up and running with the latest version.
This demo also makes use of the approval feature in cert-manager. This means the default certificaterequests-approver must be disabled. There are instructions here which explain this in more detail.
If you have an existing deployment, make sure that the controllers flag is set on the cert-manager controller and that the certificaterequests-approver is disabled, e.g.:
We also need a means to share the Trust Bundle with each pod. While the CSI driver includes this along with the SPIFFE ID for pods for us, we need to get the trust domain material into a place the CSI driver can use it. The cert-manager/trust can help here by making sure that a CA certificate is replicated.
The instructions to install this component can be found here.
We need a root CA for a two reasons:
First we can create a self signed root CA and then use this in a CA issuer. In production you’d likely have this CA managed elsewhere, Vault or Venafi for example.
Note, that since we disabled the automatic approval above, you’ll also need to manually approve the CertificateRequest here:
We can then create that CA issuer like this:
Finally, we need to configure cert-manager/trust to replicate this Trust Bundle. This will make it available to the CSI driver when we install it next.
Next, we’re going to make use of another cert-manager component which can mount SPIFFE IDs into pods. This is a special CSI driver which makes use of the Kubernetes CSI integration to make getting a certificate into each pod really easy. The CSI driver will also use our issuer and Trust Bundle that we created in the previous step to provide X.509 SVIDs to pods along with the CA public keys.
The instructions to install this component can be found here.
When setting up this component, make sure that you have:
The certificates issued by this integration will be used by our workloads to communicate securely and verify each other. That’s all the infrastructure we need to get up and running with our example applications.
Our X.509 certificates containing SPIFFE IDs (SVIDs) can be used by any workload configured to use them for mTLS. In this post, we’ll use a simple gRPC client and server. Each will use go-spiffe to help make and verify the mTLS connections. Note however that this is not a requirement, it just makes it easier when writing Go applications using SPIFFE.
Let’s have a high level look into the anatomy of our SPIFFE enabled workloads. The code here is presented to give an overview of the example apps. You might prefer to have a look at the full demo yourself. You can find the code on GitHub here.
The first thing both apps need to do is load their X.509 certificate containing their SPIFFE ID:
This config is loaded into a singleton store so it can be accessed by parts of the program that need it. We can create a server in this way and use it like this:
Note how the server is set to AuthorizeAny. We can see how to be more selective in our client example later. The server is still aware of the client SPIFFE ID and it can still be used in authorization decisions.
Our server has a single function: saying hello to clients and logging this. Our hello world handler code looks like this:
Now let’s see what the client looks like. The client also needs to load its SPIFFE ID and trusted CA data as the server does - that much is same. What we see here is that the client also has been configured with the SPIFFE ID of the server so it knows what the expected ID of the server is.
This works with a custom ‘Authorizer’ that takes a known ID.
We can then use this authorizer in our connection settings and use the connection to set up a new client:
Our client also isn’t very exciting, it just sends a Hello World message to the server every second and logs the error or message each time:
Remember to try out the end-to-end demo in kind on GitHub if you’d like to see how it works in detail.
In summary, we’ve talked about how workload identity implementations are often lacking and proposed a set of ideals for a modern workload identity system. We then present a means of achieving them for workloads on a Kubernetes cluster based on cert-manager and its SPIFFE integration.
We’re really interested in SPIFFE at Jetstack. We’re experimenting with it and have presented some of our work at KubeCon EU 2022. If you’re working with SPIFFE too and would like to chat you can find us on the Kubernetes Slack, Twitter or via the Jetstack website.