Host FS Plan for K8S Cluster

A Kubernetes nodeโ€“friendly FS layout that follows the FHS spirit, but also reflects how kubeadm-style clusters and container runtimes actually behave in the wild.

๐Ÿ”น Core node layout

Path Role Notes
/ Root FS Keep lean; donโ€™t fill it with containers or logs.
/etc/kubernetes Control plane + kubelet config From kubeadm (manifests, kubeconfig, certs).
/var/lib/kubelet Kubelet state + pod sandboxes + volume mounts This is critical; make it its own FS if you want crash isolation.
/var/lib/containerd or /var/lib/docker Container runtime layers + images Put on fast disk (NVMe/SSD) for image pull and unpack speed.
/var/log/pods Per-pod log symlinks Kubelet links container logs here.
/var/log/containers Symlinks to container runtime logs Used by logging agents (fluent-bit, promtail, etc.).
/var/log System logs journald, syslog, kernel. Donโ€™t let app logs flood it.
/srv/nfs or /srv/storage If this node exports volumes (NFS, Gluster, Ceph gateways, etc.) Clean separation from kubeletโ€™s internals.
/data General bulk storage for PV backends For CSI drivers or hostPath experiments.
/backup Backups of etcd, manifests, configs Keep isolated from /var.

๐Ÿ”น Mount strategy (fstab style)

Example /etc/fstab for a worker:

# Root + boot
UUID=...  /                  xfs  defaults,noatime  0 1
UUID=...  /boot              ext4 defaults          0 2

# Separate FS for kubelet (pods, volumes)
UUID=...  /var/lib/kubelet   xfs  defaults          0 2

# Separate FS for container runtime
UUID=...  /var/lib/containerd xfs defaults          0 2

# Log partition
UUID=...  /var/log           xfs  defaults,nodev,noexec,nosuid  0 2

# Data partition for PV backends
UUID=...  /data              xfs  defaults          0 2

# Backups (etcd snapshots, configs)
UUID=...  /backup            xfs  defaults,noatime  0 2

๐Ÿ”น Why this helps


๐Ÿ”น Control plane nodes (extra)


โœ… TL;DR:


๐Ÿ”น Key directories & their SELinux types

RHEL/Kubernetes gotchas: Splitting things into separate partitions may result in loss of the expected SELinux labels (system_u:object_r:container_file_t:s0, etc.). That can break kubelet, containerd, or logging.

The fix is to assign fcontext rules so mounts inherit the right labels:

Directory Purpose Expected SELinux type
/var/lib/kubelet Pod dirs, volumes container_file_t
/var/lib/containerd or /var/lib/docker Images, layers container_var_lib_t (RHEL9/8), sometimes container_file_t
/var/lib/etcd etcd DB etcd_var_lib_t
/var/log/containers Symlinks to container logs container_log_t
/var/log/pods Per-pod log dirs container_log_t
/var/log (generic system logs) journald, syslog var_log_t
/srv/nfs (if exporting) NFS data public_content_rw_t (or nfs_t for exports)
/data (CSI/PV backends) App volumes Usually container_file_t if kubelet uses it directly

Sizing

A capacity planning sketch for a generic (control or worker) Kubernetes node given 1 TB total disk to allocate per node:

๐Ÿ”นKubernetes Node Disk Allocation (1 TB total)

Mount point Size (GB) % of total Notes
/ (root) 50โ€“75 ~7% OS, packages, /etc, system libs. Keep lean.
/var/lib/kubelet 200 20% Pod sandboxes, ephemeral volumes, secrets/configs. Needs breathing room.
/var/lib/containerd 300 30% Container images & unpacked layers. Image-heavy clusters chew disk here.
/var/lib/etcd 50 ~5% Control-plane only. Needs low latency, not huge size.
/var/log 50โ€“75 ~7% System + container logs. With log rotation, 50โ€“75 GB is comfortable.
/data 250โ€“300 25โ€“30% Bulk storage for PersistentVolumes, NFS-backed paths, testing hostPath.
/backup 50โ€“75 ~7% Etcd snapshots, configs, small dataset archives.

k8s-node-disk-allocation.webp

๐Ÿ”น Why these sizes

The kubelet and the container runtime are really greedy about disk, especially once a cluster is busy. Hereโ€™s why they need so much breathing room:

๐Ÿ”น /var/lib/kubelet (pod sandbox + ephemeral volumes)

๐Ÿ‘‰ On a busy node, this fills up shockingly fast โ€” hence giving it 150โ€“200 GB is sane.

๐Ÿ”น /var/lib/containerd (image storage + layers)

๐Ÿ‘‰ 250โ€“400 GB is very normal in real-world clusters with active pipelines.

๐Ÿ”น Why it bites ops

๐Ÿ”น Real-world anecdotes

โœ… Thatโ€™s why in your 1 TB plan, giving ~50% of the disk (500 GB) to kubelet + containerd combined is advised. Itโ€™s not waste โ€” itโ€™s survival.


๐Ÿ”น Example partitioning table

Single 1TB physical disk (sda)

/dev/sda1   50G    /                   (xfs)
/dev/sda2  200G    /var/lib/kubelet    (xfs)
/dev/sda3  300G    /var/lib/containerd (xfs)
/dev/sda4   50G    /var/lib/etcd       (xfs)   # control-plane only
/dev/sda5   75G    /var/log            (xfs)
/dev/sda6  250G    /data               (xfs)
/dev/sda7   75G    /backup             (xfs)

๐Ÿ”น Variations

๐Ÿ”น SELinux : Setting persistent mappings

Example fcontext rules

# Kubelet
semanage fcontext -a -t container_file_t "/var/lib/kubelet(/.*)?"

# Container runtime (containerd)
semanage fcontext -a -t container_var_lib_t "/var/lib/containerd(/.*)?"

# Docker alternative
semanage fcontext -a -t container_var_lib_t "/var/lib/docker(/.*)?"

# etcd DB
semanage fcontext -a -t etcd_var_lib_t "/var/lib/etcd(/.*)?"

# Pod & container logs
semanage fcontext -a -t container_log_t "/var/log/containers(/.*)?"
semanage fcontext -a -t container_log_t "/var/log/pods(/.*)?"

# PV backends
semanage fcontext -a -t container_file_t "/data(/.*)?"

# Service exports
semanage fcontext -a -t public_content_rw_t "/srv/nfs(/.*)?"

Apply them

restorecon -Rv /var/lib/kubelet
restorecon -Rv /var/lib/containerd
restorecon -Rv /var/lib/etcd
restorecon -Rv /var/log/containers
restorecon -Rv /var/log/pods
restorecon -Rv /data
restorecon -Rv /srv/nfs

๐Ÿ”น Verify labels

ls -Zd /var/lib/kubelet
ls -Zd /var/lib/containerd
ls -Zd /var/log/containers

Example output:

drwx------. root root system_u:object_r:container_file_t:s0 /var/lib/kubelet

๐Ÿ”น Why this matters

โœ… Best practice on RHEL-based Kubernetes nodes:

Always run semanage fcontext + restorecon after introducing new partitions for kubelet, containerd, etcd, or PV backends.


A clean provision script to run on RHEL-based Kubernetes nodes to set all the right SELinux fcontext mappings in one go.

Itโ€™s idempotent:


๐Ÿ”น Usage

  1. Save it as provision-selinux-k8s.sh.
  2. Run once on each node (or push via Ansible):

    sudo bash provision-selinux-k8s.sh
    
  3. Verify:

    ls -Zd /var/lib/kubelet /var/lib/containerd /var/lib/etcd /var/log/containers
    

Ansible Playbook

A clean Ansible role to drop into a bootstrap playbook. It uses Ansibleโ€™s community.general.sefcontext and ansible.builtin.command modules to ensure SELinux mappings are persistent and applied.

๐Ÿ”น Role structure

roles/
โ””โ”€โ”€ selinux_fcontext_k8s/
    โ”œโ”€โ”€ tasks/
    โ”‚   โ””โ”€โ”€ main.yml
    โ””โ”€โ”€ meta/
        โ””โ”€โ”€ main.yml

๐Ÿ”น tasks/main.yml

---
- name: Ensure policycoreutils-python-utils installed (RHEL 8/9)
  ansible.builtin.package:
    name: policycoreutils-python-utils
    state: present

- name: Define SELinux fcontexts for kubelet
  community.general.sefcontext:
    target: "/var/lib/kubelet(/.*)?"
    setype: container_file_t
    state: present

- name: Define SELinux fcontexts for containerd
  community.general.sefcontext:
    target: "/var/lib/containerd(/.*)?"
    setype: container_var_lib_t
    state: present

- name: Define SELinux fcontexts for docker (if used)
  community.general.sefcontext:
    target: "/var/lib/docker(/.*)?"
    setype: container_var_lib_t
    state: present

- name: Define SELinux fcontexts for etcd (control-plane only)
  community.general.sefcontext:
    target: "/var/lib/etcd(/.*)?"
    setype: etcd_var_lib_t
    state: present

- name: Define SELinux fcontexts for container logs
  community.general.sefcontext:
    target: "/var/log/containers(/.*)?"
    setype: container_log_t
    state: present

- name: Define SELinux fcontexts for pod logs
  community.general.sefcontext:
    target: "/var/log/pods(/.*)?"
    setype: container_log_t
    state: present

- name: Define SELinux fcontexts for generic data PVs
  community.general.sefcontext:
    target: "/data(/.*)?"
    setype: container_file_t
    state: present

- name: Define SELinux fcontexts for NFS exports
  community.general.sefcontext:
    target: "/srv/nfs(/.*)?"
    setype: public_content_rw_t
    state: present

- name: Restore SELinux contexts recursively
  ansible.builtin.command: restorecon -Rv {{ item }}
  loop:
    - /var/lib/kubelet
    - /var/lib/containerd
    - /var/lib/docker
    - /var/lib/etcd
    - /var/log/containers
    - /var/log/pods
    - /data
    - /srv/nfs
  register: restorecon_out
  changed_when: restorecon_out.rc == 0

๐Ÿ”น meta/main.yml

---
dependencies: []

๐Ÿ”น Playbook example

- hosts: k8s_nodes
  become: true
  roles:
    - selinux_fcontext_k8s

โœ… This ensures:


Ansible tasks for systemd drop-ins

So kubelet/containerd service units depend on the correct mounts being present before they start. That ties neatly into this partitioning + SELinux scheme.

Tying systemd drop-ins into the Ansible workflow ensures that kubelet and containerd (or docker) only start once their required mount points are present. This avoids race conditions at boot, where services fail because /var/lib/kubelet or /var/lib/containerd wasnโ€™t mounted yet.

๐Ÿ”น systemd drop-in strategy

and similar for containerd/docker.

Systemd then ensures the mount unit is active before starting the service.

๐Ÿ”น Updated role structure

roles/
โ””โ”€โ”€ selinux_fcontext_k8s/
    โ”œโ”€โ”€ tasks/
    โ”‚   โ”œโ”€โ”€ main.yml
    โ”‚   โ””โ”€โ”€ systemd.yml
    โ”œโ”€โ”€ templates/
    โ”‚   โ”œโ”€โ”€ 10-requires-mounts-kubelet.conf.j2
    โ”‚   โ””โ”€โ”€ 10-requires-mounts-containerd.conf.j2
    โ””โ”€โ”€ meta/
        โ””โ”€โ”€ main.yml

๐Ÿ”น tasks/systemd.yml

---
- name: Ensure drop-in directory for kubelet
  ansible.builtin.file:
    path: /etc/systemd/system/kubelet.service.d
    state: directory
    mode: "0755"

- name: Ensure drop-in directory for containerd
  ansible.builtin.file:
    path: /etc/systemd/system/containerd.service.d
    state: directory
    mode: "0755"

- name: Deploy kubelet mount requirement drop-in
  ansible.builtin.template:
    src: 10-requires-mounts-kubelet.conf.j2
    dest: /etc/systemd/system/kubelet.service.d/10-requires-mounts.conf
    mode: "0644"
  notify: Reload systemd

- name: Deploy containerd mount requirement drop-in
  ansible.builtin.template:
    src: 10-requires-mounts-containerd.conf.j2
    dest: /etc/systemd/system/containerd.service.d/10-requires-mounts.conf
    mode: "0644"
  notify: Reload systemd

๐Ÿ”น templates/10-requires-mounts-kubelet.conf.j2

[Unit]
RequiresMountsFor=/var/lib/kubelet
RequiresMountsFor=/var/log/pods
RequiresMountsFor=/var/log/containers

๐Ÿ”น templates/10-requires-mounts-containerd.conf.j2

[Unit]
RequiresMountsFor=/var/lib/containerd

(If using Docker, just change to /var/lib/docker.)

๐Ÿ”น Add handlers in tasks/main.yml

handlers:
  - name: Reload systemd
    ansible.builtin.command: systemctl daemon-reload

  - name: Restart kubelet
    ansible.builtin.service:
      name: kubelet
      state: restarted

  - name: Restart containerd
    ansible.builtin.service:
      name: containerd
      state: restarted

๐Ÿ”น Playbook snippet

- hosts: k8s_nodes
  become: true
  roles:
    - selinux_fcontext_k8s
  tasks:
    - include_role:
        name: selinux_fcontext_k8s
        tasks_from: systemd.yml

โœ… Now at boot: