Kubernetes 1.28: Beta support for using swap on Linux
The 1.22 release introduced Alpha support for configuring swap memory usage for Kubernetes workloads running on Linux on a per-node basis. Now, in release 1.28, support for swap on Linux nodes has graduated to Beta, along with many new improvements.
Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization when swap memory was involved. As a result, swap support was deemed out of scope in the initial design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory was detected on a node.
In version 1.22, the swap feature for Linux was initially introduced in its Alpha stage. This represented a significant advancement, providing Linux users with the opportunity to experiment with the swap feature for the first time. However, as an Alpha version, it was not fully developed and had several issues, including inadequate support for cgroup v2, insufficient metrics and summary API statistics, inadequate testing, and more.
Swap in Kubernetes has numerous use cases for a wide range of users. As a result, the node special interest group within the Kubernetes project has invested significant effort into supporting swap on Linux nodes for beta. Compared to the alpha, the kubelet's support for running with swap enabled is more stable and robust, more user-friendly, and addresses many known shortcomings. This graduation to beta represents a crucial step towards achieving the goal of fully supporting swap in Kubernetes.
How do I use it?
The utilization of swap memory on a node where it has already been provisioned can be
facilitated by the activation of the NodeSwap
feature gate on the kubelet.
Additionally, you must disable the failSwapOn
configuration setting, or the deprecated
--fail-swap-on
command line flag must be deactivated.
It is possible to configure the memorySwap.swapBehavior
option to define the manner in which a node utilizes swap memory. For instance,
# this fragment goes into the kubelet's configuration file
memorySwap:
swapBehavior: UnlimitedSwap
The available configuration options for swapBehavior
are:
UnlimitedSwap
(default): Kubernetes workloads can use as much swap memory as they request, up to the system limit.LimitedSwap
: The utilization of swap memory by Kubernetes workloads is subject to limitations. Only Pods of Burstable QoS are permitted to employ swap.
If configuration for memorySwap
is not specified and the feature gate is
enabled, by default the kubelet will apply the same behaviour as the
UnlimitedSwap
setting.
Note that NodeSwap
is supported for cgroup v2 only. For Kubernetes v1.28,
using swap along with cgroup v1 is no longer supported.
Install a swap-enabled cluster with kubeadm
Before you begin
It is required for this demo that the kubeadm tool be installed, following the steps outlined in the kubeadm installation guide. If swap is already enabled on the node, cluster creation may proceed. If swap is not enabled, please refer to the provided instructions for enabling swap.
Create a swap file and turn swap on
I'll demonstrate creating 4GiB of unencrypted swap.
dd if=/dev/zero of=/swapfile bs=128M count=32
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
swapon -s # enable the swap file only until this node is rebooted
To start the swap file at boot time, add line like /swapfile swap swap defaults 0 0
to /etc/fstab
file.
Set up a Kubernetes cluster that uses swap-enabled nodes
To make things clearer, here is an example kubeadm configuration file kubeadm-config.yaml
for the swap enabled cluster.
---
apiVersion: "kubeadm.k8s.io/v1beta3"
kind: InitConfiguration
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false
featureGates:
NodeSwap: true
memorySwap:
swapBehavior: LimitedSwap
Then create a single-node cluster using kubeadm init --config kubeadm-config.yaml
.
During init, there is a warning that swap is enabled on the node and in case the kubelet
failSwapOn
is set to true. We plan to remove this warning in a future release.
How is the swap limit being determined with LimitedSwap?
The configuration of swap memory, including its limitations, presents a significant challenge. Not only is it prone to misconfiguration, but as a system-level property, any misconfiguration could potentially compromise the entire node rather than just a specific workload. To mitigate this risk and ensure the health of the node, we have implemented Swap in Beta with automatic configuration of limitations.
With LimitedSwap
, Pods that do not fall under the Burstable QoS classification (i.e.
BestEffort
/Guaranteed
Qos Pods) are prohibited from utilizing swap memory.
BestEffort
QoS Pods exhibit unpredictable memory consumption patterns and lack
information regarding their memory usage, making it difficult to determine a safe
allocation of swap memory. Conversely, Guaranteed
QoS Pods are typically employed for
applications that rely on the precise allocation of resources specified by the workload,
with memory being immediately available. To maintain the aforementioned security and node
health guarantees, these Pods are not permitted to use swap memory when LimitedSwap
is
in effect.
Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:
nodeTotalMemory
: The total amount of physical memory available on the node.totalPodsSwapAvailable
: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use).containerMemoryRequest
: The container's memory request.
Swap limitation is configured as:
(containerMemoryRequest / nodeTotalMemory) × totalPodsSwapAvailable
In other words, the amount of swap that a container is able to use is proportionate to its memory request, the node's total physical memory and the total amount of swap memory on the node that is available for use by Pods.
It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of swap usage by specifying memory requests that are equal to memory limits. Containers configured in this manner will not have access to swap memory.
How does it work?
There are a number of possible ways that one could envision swap use on a node. When swap is already provisioned and available on a node, SIG Node have proposed the kubelet should be able to be configured so that:
- It can start with swap on.
- It will direct the Container Runtime Interface to allocate zero swap memory to Kubernetes workloads by default.
Swap configuration on a node is exposed to a cluster admin via the
memorySwap
in the KubeletConfiguration.
As a cluster administrator, you can specify the node's behaviour in the
presence of swap memory by setting memorySwap.swapBehavior
.
The kubelet employs the CRI
(container runtime interface) API to direct the CRI to
configure specific cgroup v2 parameters (such as memory.swap.max
) in a manner that will
enable the desired swap configuration for a container. The CRI is then responsible to
write these settings to the container-level cgroup.
How can I monitor swap?
A notable deficiency in the Alpha version was the inability to monitor and introspect swap usage. This issue has been addressed in the Beta version introduced in Kubernetes 1.28, which now provides the capability to monitor swap usage through several different methods.
The beta version of kubelet now collects
node-level metric statistics,
which can be accessed at the /metrics/resource
and /stats/summary
kubelet HTTP endpoints.
This allows clients who can directly interrogate the kubelet to
monitor swap usage and remaining swap memory when using LimitedSwap. Additionally, a
machine_swap_bytes
metric has been added to cadvisor to show the total physical swap capacity of the
machine.
Caveats
Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe.
As such, we do not advocate the utilization of swap memory for workloads or
environments that are subject to performance constraints. Furthermore, it is
recommended to employ LimitedSwap
, as this significantly mitigates the risks
posed to the node.
Cluster administrators and developers should benchmark their nodes and applications before using swap in production scenarios, and we need your help with that!
Security risk
Enabling swap on a system without encryption poses a security risk, as critical information, such as volumes that represent Kubernetes Secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these confidential data. To mitigate this risk, the Kubernetes project strongly recommends that you encrypt your swap space. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. It is the administrator's responsibility to provision encrypted swap to mitigate this risk.
Furthermore, as previously mentioned, with LimitedSwap
the user has the option to completely
disable swap usage for a container by specifying memory requests that are equal to memory limits.
This will prevent the corresponding containers from accessing swap memory.
Looking ahead
The Kubernetes 1.28 release introduced Beta support for swap memory on Linux nodes, and we will continue to work towards general availability for this feature. I hope that this will include:
- Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
- Adding support for controlling swap consumption at the Pod level via cgroups.
- This point is still under discussion.
- Collecting feedback from test user cases.
- We will consider introducing new configuration modes for swap, such as a node-wide swap limit for workloads.
How can I learn more?
You can review the current documentation for using swap with Kubernetes.
For more information, and to assist with testing and provide feedback, please see KEP-2400 and its design proposal.
How do I get involved?
Your feedback is always welcome! SIG Node meets regularly and can be reached via Slack (channel #sig-node), or the SIG's mailing list. A Slack channel dedicated to swap is also available at #sig-node-swap.
Feel free to reach out to me, Itamar Holder (@iholder101 on Slack and GitHub) if you'd like to help or ask further questions.