Host FS Plan for K8S Cluster
A Kubernetes nodeโfriendly FS layout that follows the FHS spirit, but also reflects how kubeadm-style clusters and container runtimes actually behave in the wild.
๐น Core node layout
| Path | Role | Notes |
|---|---|---|
/ |
Root FS | Keep lean; donโt fill it with containers or logs. |
/etc/kubernetes |
Control plane + kubelet config | From kubeadm (manifests, kubeconfig, certs). |
/var/lib/kubelet |
Kubelet state + pod sandboxes + volume mounts | This is critical; make it its own FS if you want crash isolation. |
/var/lib/containerd or /var/lib/docker |
Container runtime layers + images | Put on fast disk (NVMe/SSD) for image pull and unpack speed. |
/var/log/pods |
Per-pod log symlinks | Kubelet links container logs here. |
/var/log/containers |
Symlinks to container runtime logs | Used by logging agents (fluent-bit, promtail, etc.). |
/var/log |
System logs | journald, syslog, kernel. Donโt let app logs flood it. |
/srv/nfs or /srv/storage |
If this node exports volumes (NFS, Gluster, Ceph gateways, etc.) | Clean separation from kubeletโs internals. |
/data |
General bulk storage for PV backends | For CSI drivers or hostPath experiments. |
/backup |
Backups of etcd, manifests, configs | Keep isolated from /var. |
๐น Mount strategy (fstab style)
Example /etc/fstab for a worker:
# Root + boot
UUID=... / xfs defaults,noatime 0 1
UUID=... /boot ext4 defaults 0 2
# Separate FS for kubelet (pods, volumes)
UUID=... /var/lib/kubelet xfs defaults 0 2
# Separate FS for container runtime
UUID=... /var/lib/containerd xfs defaults 0 2
# Log partition
UUID=... /var/log xfs defaults,nodev,noexec,nosuid 0 2
# Data partition for PV backends
UUID=... /data xfs defaults 0 2
# Backups (etcd snapshots, configs)
UUID=... /backup xfs defaults,noatime 0 2
๐น Why this helps
Blast radius control
- If
/var/lib/kubeletfills up (e.g. stuck PVs), it wonโt choke/. - If
/var/logfills up, kubelet still runs.
- If
Performance
- Container images (
/var/lib/containerd) on SSD โ faster pulls & launches. - PV backends (
/data) on slower disks is fine.
- Container images (
Ops clarity
/srvโ for things exported by the node (if you run NFS-server or Ceph)./backupโ easy scripting, obvious intent.
๐น Control plane nodes (extra)
/var/lib/etcdโ etcd database.- Put this on fast, durable disk (low fsync latency).
- Often its own volume/partition so noisy workloads donโt spike etcd I/O.
/etc/kubernetes/pkiโ cluster certs.- Small, but back it up.
โ TL;DR:
/var/lib/kubeletand/var/lib/containerdโ dedicated FS./var/logโ separate FS with noexec/nodev/nosuid./var/lib/etcd(control plane only) โ its own fast FS./dataand/srvโ your playground for persistent volumes and service exports.
๐น Key directories & their SELinux types
RHEL/Kubernetes gotchas: Splitting things into separate partitions may result in loss of the expected SELinux labels (system_u:object_r:container_file_t:s0, etc.). That can break kubelet, containerd, or logging.
The fix is to assign fcontext rules so mounts inherit the right labels:
| Directory | Purpose | Expected SELinux type |
|---|---|---|
/var/lib/kubelet |
Pod dirs, volumes | container_file_t |
/var/lib/containerd or /var/lib/docker |
Images, layers | container_var_lib_t (RHEL9/8), sometimes container_file_t |
/var/lib/etcd |
etcd DB | etcd_var_lib_t |
/var/log/containers |
Symlinks to container logs | container_log_t |
/var/log/pods |
Per-pod log dirs | container_log_t |
/var/log (generic system logs) |
journald, syslog | var_log_t |
/srv/nfs (if exporting) |
NFS data | public_content_rw_t (or nfs_t for exports) |
/data (CSI/PV backends) |
App volumes | Usually container_file_t if kubelet uses it directly |
Sizing
A capacity planning sketch for a generic (control or worker) Kubernetes node given 1 TB total disk to allocate per node:
๐นKubernetes Node Disk Allocation (1 TB total)
| Mount point | Size (GB) | % of total | Notes |
|---|---|---|---|
/ (root) |
50โ75 | ~7% | OS, packages, /etc, system libs. Keep lean. |
/var/lib/kubelet |
200 | 20% | Pod sandboxes, ephemeral volumes, secrets/configs. Needs breathing room. |
/var/lib/containerd |
300 | 30% | Container images & unpacked layers. Image-heavy clusters chew disk here. |
/var/lib/etcd |
50 | ~5% | Control-plane only. Needs low latency, not huge size. |
/var/log |
50โ75 | ~7% | System + container logs. With log rotation, 50โ75 GB is comfortable. |
/data |
250โ300 | 25โ30% | Bulk storage for PersistentVolumes, NFS-backed paths, testing hostPath. |
/backup |
50โ75 | ~7% | Etcd snapshots, configs, small dataset archives. |

๐น Why these sizes
Root (
/): Modern RHEL installs with GNOME and full tools can bloat >20 GB. 50 GB gives you a buffer but avoids waste./var/lib/kubelet:- Pods mount emptyDirs, configMaps, secrets โ all live here.
- Bursty workloads (CI/CD, batch jobs) fill it quickly. 200 GB is safe.
/var/lib/containerd:- Pulling large images (e.g. AI/ML or Java stacks) eats disk fast.
- If you keep multiple versions/tags, you want headroom. 300 GB is a healthy balance.
/var/lib/etcd:- Each etcd member stores a compressed history. Even large clusters rarely need >20 GB.
- The real requirement is low fsync latency โ give it SSD/NVMe if possible.
/var/log:- Journal logs + kubelet/containerd logs.
- With logrotate or fluent-bit shipping, 50โ75 GB is safe.
/data:- Largest flexible bucket.
- Good for app PVs, experimental workloads, or serving NFS.
/backup:- Keeps etcd snapshots & config archives separate.
- If you offload backups elsewhere (NAS, object store), 50 GB is plenty.
The kubelet and the container runtime are really greedy about disk, especially once a cluster is busy. Hereโs why they need so much breathing room:
๐น /var/lib/kubelet (pod sandbox + ephemeral volumes)
- Every pod gets a โsandboxโ directory under here.
emptyDir volumes โ all live on the nodeโs disk under
/var/lib/kubelet/pods/.../volumes/....- Think CI jobs unpacking tarballs, ML jobs writing scratch data, etc.
Secrets and ConfigMaps get materialized here too (lots of small files).
If a pod crashes and restarts, kubelet may keep the old dirs until garbage collection runs.
๐ On a busy node, this fills up shockingly fast โ hence giving it 150โ200 GB is sane.
๐น /var/lib/containerd (image storage + layers)
- Each image you
pullgets unpacked into multiple layers under here. - Multiple tags of the same base image = more layers.
- Even after container exit, unless GC has purged, old layers stay around.
- Large images (e.g. AI/ML with CUDA, or Java stacks) can be 5โ10 GB each. Multiply by dozens of apps and versions โ hundreds of GB easily.
๐ 250โ400 GB is very normal in real-world clusters with active pipelines.
๐น Why it bites ops
- If
/var/lib/containerdor/var/lib/kubeletfill up, kubelet goes into โImageGCโ or โEvictionโ mode. That means pods get killed to free space. - Worse, if it fills root (
/) because you didnโt split partitions, the node can hard crash (Read-only file systemremounts).
๐น Real-world anecdotes
- GitLab CI/CD runners in Kubernetes โ constantly pull different images for pipelines. Nodes without big
/var/lib/containerdpartitions churned through disk in hours. - ML workloads pulling PyTorch/TensorFlow images (10โ15 GB each) + checkpoints in
emptyDirโ 200 GB per node vanished almost overnight. - Default / partition only โ kubelet crashes because journald + container images + pods fight for the same disk.
โ Thatโs why in your 1 TB plan, giving ~50% of the disk (500 GB) to kubelet + containerd combined is advised. Itโs not waste โ itโs survival.
๐น Example partitioning table
Single 1TB physical disk (sda)
/dev/sda1 50G / (xfs)
/dev/sda2 200G /var/lib/kubelet (xfs)
/dev/sda3 300G /var/lib/containerd (xfs)
/dev/sda4 50G /var/lib/etcd (xfs) # control-plane only
/dev/sda5 75G /var/log (xfs)
/dev/sda6 250G /data (xfs)
/dev/sda7 75G /backup (xfs)
๐น Variations
- Workers only: Drop
/var/lib/etcdand give that 50 GB to/data. - Control-plane only: Keep
/var/lib/etcdsmall but fast. - Storage-heavy nodes: Bias more towards
/data(e.g. 400 GB) if you host PVs directly. - Image-heavy CI/CD nodes: Increase
/var/lib/containerdup to 400 GB.
๐น SELinux : Setting persistent mappings
Example fcontext rules
# Kubelet
semanage fcontext -a -t container_file_t "/var/lib/kubelet(/.*)?"
# Container runtime (containerd)
semanage fcontext -a -t container_var_lib_t "/var/lib/containerd(/.*)?"
# Docker alternative
semanage fcontext -a -t container_var_lib_t "/var/lib/docker(/.*)?"
# etcd DB
semanage fcontext -a -t etcd_var_lib_t "/var/lib/etcd(/.*)?"
# Pod & container logs
semanage fcontext -a -t container_log_t "/var/log/containers(/.*)?"
semanage fcontext -a -t container_log_t "/var/log/pods(/.*)?"
# PV backends
semanage fcontext -a -t container_file_t "/data(/.*)?"
# Service exports
semanage fcontext -a -t public_content_rw_t "/srv/nfs(/.*)?"
Apply them
restorecon -Rv /var/lib/kubelet
restorecon -Rv /var/lib/containerd
restorecon -Rv /var/lib/etcd
restorecon -Rv /var/log/containers
restorecon -Rv /var/log/pods
restorecon -Rv /data
restorecon -Rv /srv/nfs
๐น Verify labels
ls -Zd /var/lib/kubelet
ls -Zd /var/lib/containerd
ls -Zd /var/log/containers
Example output:
drwx------. root root system_u:object_r:container_file_t:s0 /var/lib/kubelet
๐น Why this matters
- Without these, a new FS mounted at
/var/lib/kubeletcould inheritdefault_torvar_lib_t, and then kubelet fails to start pods with AVC denials. - Same for container logs: if theyโre not
container_log_t, your log collector (fluent-bit, promtail) might get blocked. - With fcontext rules, SELinux auto-applies the right labels after every reboot/remount.
โ Best practice on RHEL-based Kubernetes nodes:
Always run semanage fcontext + restorecon after introducing new partitions for kubelet, containerd, etcd, or PV backends.
A clean provision script to run on RHEL-based Kubernetes nodes to set all the right SELinux fcontext mappings in one go.
Itโs idempotent:
- If
semanagerules already exist, it wonโt duplicate. It runs
restoreconafter to apply labels immediately.#!/usr/bin/env bash # # provision-selinux-k8s.sh # Ensure SELinux contexts are correct for Kubernetes node directories. # # RHEL / CentOS / Rocky / Alma / Fedora compatible. set -euo pipefail # Check for semanage if ! command -v semanage >/dev/null 2>&1; then echo "ERROR: semanage not found. Install policycoreutils-python-utils (RHEL8/9)." exit 1 fi echo "โถ Setting SELinux fcontext rules for Kubernetes node directories..." # Kubelet semanage fcontext -a -t container_file_t "/var/lib/kubelet(/.*)?" # Container runtime (containerd or docker) semanage fcontext -a -t container_var_lib_t "/var/lib/containerd(/.*)?" semanage fcontext -a -t container_var_lib_t "/var/lib/docker(/.*)?" # etcd DB (control-plane nodes only, harmless elsewhere) semanage fcontext -a -t etcd_var_lib_t "/var/lib/etcd(/.*)?" # Pod & container logs semanage fcontext -a -t container_log_t "/var/log/containers(/.*)?" semanage fcontext -a -t container_log_t "/var/log/pods(/.*)?" # PV backends (generic /data volume) semanage fcontext -a -t container_file_t "/data(/.*)?" # Service exports (if node also exports via NFS/HTTP/etc.) semanage fcontext -a -t public_content_rw_t "/srv/nfs(/.*)?" echo "โถ Applying SELinux contexts..." restorecon -Rv /var/lib/kubelet || true restorecon -Rv /var/lib/containerd || true restorecon -Rv /var/lib/docker || true restorecon -Rv /var/lib/etcd || true restorecon -Rv /var/log/containers || true restorecon -Rv /var/log/pods || true restorecon -Rv /data || true restorecon -Rv /srv/nfs || true echo "โ SELinux fcontexts applied successfully."
๐น Usage
- Save it as
provision-selinux-k8s.sh. Run once on each node (or push via Ansible):
sudo bash provision-selinux-k8s.shVerify:
ls -Zd /var/lib/kubelet /var/lib/containerd /var/lib/etcd /var/log/containers
Ansible Playbook
A clean Ansible role to drop into a bootstrap playbook.
It uses Ansibleโs community.general.sefcontext and ansible.builtin.command
modules to ensure SELinux mappings are persistent and applied.
๐น Role structure
roles/
โโโ selinux_fcontext_k8s/
โโโ tasks/
โ โโโ main.yml
โโโ meta/
โโโ main.yml
๐น tasks/main.yml
---
- name: Ensure policycoreutils-python-utils installed (RHEL 8/9)
ansible.builtin.package:
name: policycoreutils-python-utils
state: present
- name: Define SELinux fcontexts for kubelet
community.general.sefcontext:
target: "/var/lib/kubelet(/.*)?"
setype: container_file_t
state: present
- name: Define SELinux fcontexts for containerd
community.general.sefcontext:
target: "/var/lib/containerd(/.*)?"
setype: container_var_lib_t
state: present
- name: Define SELinux fcontexts for docker (if used)
community.general.sefcontext:
target: "/var/lib/docker(/.*)?"
setype: container_var_lib_t
state: present
- name: Define SELinux fcontexts for etcd (control-plane only)
community.general.sefcontext:
target: "/var/lib/etcd(/.*)?"
setype: etcd_var_lib_t
state: present
- name: Define SELinux fcontexts for container logs
community.general.sefcontext:
target: "/var/log/containers(/.*)?"
setype: container_log_t
state: present
- name: Define SELinux fcontexts for pod logs
community.general.sefcontext:
target: "/var/log/pods(/.*)?"
setype: container_log_t
state: present
- name: Define SELinux fcontexts for generic data PVs
community.general.sefcontext:
target: "/data(/.*)?"
setype: container_file_t
state: present
- name: Define SELinux fcontexts for NFS exports
community.general.sefcontext:
target: "/srv/nfs(/.*)?"
setype: public_content_rw_t
state: present
- name: Restore SELinux contexts recursively
ansible.builtin.command: restorecon -Rv {{ item }}
loop:
- /var/lib/kubelet
- /var/lib/containerd
- /var/lib/docker
- /var/lib/etcd
- /var/log/containers
- /var/log/pods
- /data
- /srv/nfs
register: restorecon_out
changed_when: restorecon_out.rc == 0
๐น meta/main.yml
---
dependencies: []
๐น Playbook example
- hosts: k8s_nodes
become: true
roles:
- selinux_fcontext_k8s
โ This ensures:
- fcontext mappings are permanent (in SELinux policy).
- contexts are immediately applied with
restorecon. - works idempotently across re-runs.
Ansible tasks for systemd drop-ins
So kubelet/containerd service units depend on the correct mounts being present before they start. That ties neatly into this partitioning + SELinux scheme.
Tying systemd drop-ins into the Ansible workflow ensures that kubelet and containerd (or docker)
only start once their required mount points are present.
This avoids race conditions at boot,
where services fail because /var/lib/kubelet or /var/lib/containerd wasnโt mounted yet.
๐น systemd drop-in strategy
Use
systemd_unit(ortemplate+systemctl daemon-reload) to create drop-ins under:/etc/systemd/system/kubelet.service.d/10-requires-mounts.conf/etc/systemd/system/containerd.service.d/10-requires-mounts.conf
These add:
[Unit] RequiresMountsFor=/var/lib/kubelet
and similar for containerd/docker.
Systemd then ensures the mount unit is active before starting the service.
๐น Updated role structure
roles/
โโโ selinux_fcontext_k8s/
โโโ tasks/
โ โโโ main.yml
โ โโโ systemd.yml
โโโ templates/
โ โโโ 10-requires-mounts-kubelet.conf.j2
โ โโโ 10-requires-mounts-containerd.conf.j2
โโโ meta/
โโโ main.yml
๐น tasks/systemd.yml
---
- name: Ensure drop-in directory for kubelet
ansible.builtin.file:
path: /etc/systemd/system/kubelet.service.d
state: directory
mode: "0755"
- name: Ensure drop-in directory for containerd
ansible.builtin.file:
path: /etc/systemd/system/containerd.service.d
state: directory
mode: "0755"
- name: Deploy kubelet mount requirement drop-in
ansible.builtin.template:
src: 10-requires-mounts-kubelet.conf.j2
dest: /etc/systemd/system/kubelet.service.d/10-requires-mounts.conf
mode: "0644"
notify: Reload systemd
- name: Deploy containerd mount requirement drop-in
ansible.builtin.template:
src: 10-requires-mounts-containerd.conf.j2
dest: /etc/systemd/system/containerd.service.d/10-requires-mounts.conf
mode: "0644"
notify: Reload systemd
๐น templates/10-requires-mounts-kubelet.conf.j2
[Unit]
RequiresMountsFor=/var/lib/kubelet
RequiresMountsFor=/var/log/pods
RequiresMountsFor=/var/log/containers
๐น templates/10-requires-mounts-containerd.conf.j2
[Unit]
RequiresMountsFor=/var/lib/containerd
(If using Docker, just change to /var/lib/docker.)
๐น Add handlers in tasks/main.yml
handlers:
- name: Reload systemd
ansible.builtin.command: systemctl daemon-reload
- name: Restart kubelet
ansible.builtin.service:
name: kubelet
state: restarted
- name: Restart containerd
ansible.builtin.service:
name: containerd
state: restarted
๐น Playbook snippet
- hosts: k8s_nodes
become: true
roles:
- selinux_fcontext_k8s
tasks:
- include_role:
name: selinux_fcontext_k8s
tasks_from: systemd.yml
โ Now at boot:
- systemd guarantees
/var/lib/kubelet,/var/lib/containerd,/var/log/pods, and/var/log/containersare mounted before kubelet/containerd start. - Combined with SELinux fcontext setup, you get correct labeling + reliable startup.