Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emptyDir with medium: Memory mounts a tmpfs volume without nosuid,nodev,noexec #48912

Open
nrvnrvn opened this issue Jul 13, 2017 · 47 comments
Open
Labels
area/security kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@nrvnrvn
Copy link

nrvnrvn commented Jul 13, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

$ kubectl exec -ti demo-1986931840-cxt6m -- sh
/ # mount | grep tmpfs
tmpfs on /dev type tmpfs (rw,nosuid,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,relatime,mode=755)
tmpfs on /tmp type tmpfs (rw,relatime)  # mounted using emptyDir
tmpfs on /var/run type tmpfs (rw,relatime)  # mounted using emptyDir
tmpfs on /var/run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,relatime)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,mode=755)
tmpfs on /proc/timer_stats type tmpfs (rw,nosuid,mode=755)

What you expected to happen:

$ docker run --rm --read-only --tmpfs /tmp debian:9 mount | grep "nosuid,nodev,noexec" | grep tmpfs
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,relatime,mode=755)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noexec,relatime)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
It is recommended to mount tmpfs with nosuid,noexec,nodev options.

Environment:

  • Kubernetes version (use kubectl version): Client Version: v1.7.0 Server Version: v1.6.4
  • Cloud provider or hardware configuration**: minikube
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 13, 2017
@k8s-github-robot
Copy link

@nicorevin There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
e.g., @kubernetes/sig-api-machinery-* for API Machinery
(2) specifying the label manually: /sig <label>
e.g., /sig scalability for sig/scalability

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 13, 2017
@nrvnrvn
Copy link
Author

nrvnrvn commented Jul 13, 2017

@kubernetes/sig-storage-misc

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jul 13, 2017
@k8s-ci-robot
Copy link
Contributor

@nicorevin: Reiterating the mentions to trigger a notification:
@kubernetes/sig-storage-misc.

In response to this:

@kubernetes/sig-storage-misc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 13, 2017
@wongma7
Copy link
Contributor

wongma7 commented Jul 14, 2017

I don't follow why this is needed, emptyDir creates a new tmpfs "device" from nothing by doing "mount -o tmpfs tmpfs " and the only thing that then has access to that device is the pod. I think the advice about "noexec" etc. is about /tmp specifically not tmpfs in general

@nrvnrvn
Copy link
Author

nrvnrvn commented Jul 14, 2017

@wongma7 then it is docker who is wrong. Try docker run --tmpfs /lol smth

@wongma7
Copy link
Contributor

wongma7 commented Jul 15, 2017

Good point, I dug up the issue that led to that moby/moby#12143 . I'm hardly qualified to talk about security issues so I'll defer to the reviewers :). The only thing I'll add for the reviewers to consider is that the docker options are easily overridable, whereas an emptyDir user will be stuck with noexec no matter what.

@nrvnrvn
Copy link
Author

nrvnrvn commented Jul 15, 2017

the docker options are easily overridable, whereas an emptyDir user will be stuck with noexec no matter what.

This is what I was thinking about as well.
On the one hand it would probably be good to allow users to mount tmpfs with possibility to run executables.
On the other hand one container == one process. Kubernetes has elaborated this principle by introducing pods as a matter for composite container applications thus encouraging people to follow the principle of one proc per container and addressing the need to run some helper tools in scenarios where this is necessary.

Running a container with read-only root filesystem and noexec writable mounts is just one of the steps in eliminating possible attack vectors and reassuring that only the stuff pre-baked into the container will be executed. Here is a bunch of examples.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 17, 2017 via email

@nrvnrvn
Copy link
Author

nrvnrvn commented Jul 17, 2017

Ok, to summarize:

  1. add support for emptyDir.mountOptions
  2. extend PSP to allow user to pass mount options along with the volume types.
  3. Leave the defaults as they are right now.

@hach-que
Copy link

Is there any action on this? We currently have this item in our risk register.

While almost all of our pod filesystems are read-only, we are required to mount an emptyDir temporary filesystem at /tmp so that .NET Core can operate correctly (it requires writing temporary files to disk), and we want to ensure that any potential attacker can not use this emptyDir volume as a way of staging executables of their choice.

@msau42
Copy link
Member

msau42 commented Oct 19, 2017

Can someone explain the attack vector that this is trying to prevent?

@hach-que
Copy link

It's part of a defense in depth strategy. An attacker being able to write executables to disk and executing them generally has a lower threshold than convincing an application to allocate a block of memory, mark it as executable and jump to it.

@msau42
Copy link
Member

msau42 commented Oct 19, 2017

So if I understand correctly, you want to prevent a scenario like:

  1. You're running a pod using emptydir with tmpfs
  2. Your pod gets compromised.
  3. Attacker writes an executable into your emptyDir volume, and runs it

So if kubelet mounted the emptydir as noexec, then it would avoid this scenario.

But do you even need emptydir to do this? Couldn't you do this also with the container writable layer?

@hach-que
Copy link

We run all our pods with readOnlyRootFilesystem. The only thing writable is the emptyDir.

@nrvnrvn
Copy link
Author

nrvnrvn commented Oct 19, 2017

@msau42
We are willing to do the same. Running a container in read-pnly mode with tmpfs mounts for some runtime nevessary stuff (/run or /tmp) and there is a general recommendation to mount tmpfs with noexec, nosuid,nodev flags. If you run plain dpcker run —tmpfs /lol something you will see it in action

@wongma7
Copy link
Contributor

wongma7 commented Oct 19, 2017

@kubernetes/sig-auth-feature-requests

imo this feature doesn't have to take the form of emptydir.mountoptions. It doesn't even need to be an emptydir field, it could be a pod.securitycontextt* field that says "all of this pod's emptydirs must be mounted noexec,nosuid,nodev", imo that would satisfy this use-case well enough without bringing in any of the complications of having psp validations parse all pods' volumes.

e: didnt mean to recategorize as feature. 'feature'->solution to the bug and use-case->scenario

@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. kind/feature Categorizes issue or PR as related to a new feature. labels Oct 19, 2017
@nrvnrvn
Copy link
Author

nrvnrvn commented Oct 19, 2017 via email

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2018
@nrvnrvn
Copy link
Author

nrvnrvn commented Jan 17, 2018

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 17, 2018
@nikhita
Copy link
Member

nikhita commented Mar 4, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 4, 2018
@glb
Copy link
Contributor

glb commented Jul 29, 2019

I think I've got a change that might do what the OP was looking for, but haven't yet been able to do a proper e2e test.

Having gone through this, though, I'm wondering if this is really the right solution, or if the right solution is more general and involves adding mountOptions to VolumeMount.

Wanting to have noexec,nosuid,nodev (or other options) isn't just something that you would want on emptyDir: {medium: "Memory"} or emptyDir: {medium:"HugePages"}; you'd want it on emptyDir: {} as well, and likely on most volumes if you're security-conscious and not able to mount them read-only.

Briefly digging into that rabbit hole seems to lead quickly to a gap in the runtime API, where we can't pass arbitrary mount options to volumes, and even docker run doesn't support this.

Not sure if this is worth pursuing further, especially given the limitation to the memory-based emptyDir implementations. I suppose a person could rewrite the directory-based emptyDir so that it does a bind mount and applies the options...

Thoughts?

@adampl
Copy link

adampl commented Jul 30, 2019

@glb but Docker does support this:

# docker volume create --driver local --opt type=tmpfs --opt device=tmpfs \
                       --opt o=size=100m,uid=1000,noexec,nosuid,nodev foo
# docker run -d -p 8000:80 -v foo:/usr/share/nginx/html --name bar --rm nginx
# mount | grep foo
tmpfs on /var/lib/docker/volumes/foo/_data type tmpfs (rw,nosuid,nodev,noexec,relatime,size=102400k,uid=1000)

@glb
Copy link
Contributor

glb commented Jul 30, 2019

@adampl true, and in the branch I created to try this out, it seems pretty easy to extend emptyDir: { medium: "Memory" } to set the mount options. This is a rough equivalent to creating a tmpfs volume like you showed.

It's even equivalently easy to do the same for emptyDir: { medium: "HugePages"}.

However, for the default emptyDir which just uses a directory (see here and here), it's not so simple. Maybe I could hack something with a bind mount, and I'm tempted now to try that, but I keep getting dragged back to the feeling that this isn't the right solution. It doesn't seem right to be putting mount options on the volume spec; they should be on the volume mount, and you should be able to use them for any kind of volume, not just emptyDir.

@adampl
Copy link

adampl commented Jul 31, 2019

@glb Yes, I agree, this should be more generic.

@glb
Copy link
Contributor

glb commented Jul 31, 2019

Because I couldn't leave it alone: today I learned that yes, you can bind-mount a directory to itself with options, so for example:

# mkdir foo
# cat > foo/x.sh <<EOHD
#!/bin/sh
echo hi
EOHD
# chmod 755 foo/x.sh
# foo/x.sh
hi
# mount -o bind foo foo
# mount -o remount,bind,noexec foo foo
# foo/x.sh
bash: foo/x.sh: Permission denied
# umount foo
# foo/x.sh
hi 

so it would be possible to make emptyDir work with mountOptions for the default medium case where it mounts a directory instead of tmpfs / hugepages.

I still don't think that's the right solution here... the right answer feels like having mount options for all VolumeMounts.

image

@wongma7
Copy link
Contributor

wongma7 commented Jul 31, 2019

mount options in volumeMounts was shot down for security reasons. For example, there is some option that can halt the node, error=panic?

@glb
Copy link
Contributor

glb commented Jul 31, 2019

@wongma7 yeah, that would be bad. maybe there needs to be some allow list, perhaps this is what was meant in earlier comments about PodSecurityPolicy... a sufficiently-advanced admission controller could also look at mount options and either reject pods / containers with bad specs or could even add the desired options (I'm thinking noexec,nosuid,nodev, maybe others?) for containers that weren't explicitly allowed to work otherwise.

@adampl
Copy link

adampl commented Aug 1, 2019

Yep, I don't like the idea of ditching useful options for security reasons. The dangerous options should be disallowed by default, but possible to allow whichever options the cluster admin deems appropriate.

@liggitt
Copy link
Member

liggitt commented Aug 1, 2019

Yep, I don't like the idea of ditching useful options for security reasons. The dangerous options should be disallowed by default, but possible to allow whichever options the cluster admin deems appropriate.

There was significant discussion around this when introducing mount options in kubernetes/community#321 (comment)

The resolution was that mount options would only be supported for PV/PVC volume sources, not inline pod volumes.

@glb
Copy link
Contributor

glb commented Aug 1, 2019

@liggitt I may have skimmed too quickly, but the result of the conversation you linked seems to mean that a) "mount options" are really on the PV and would therefore apply equally to any container that mounts them (I can't think of a reasonable real-life counterexample); b) setting options for emptyDir seems to be contradictory to this decision because it is defined inline.

Any thoughts on what the right answer could be for emptyDir then? It sounds like the idea of a list of allowed mount options in the PSP wouldn't fly for the reasons you described; having allowedEmptyDirMountOptions in the PSP seems extremely fine-grained. Would it be insane to have a hard-coded allow list in the plugin itself that's initially limited to a reasonable set?

I don't think it would be better to have Yet Another Boolean Option useMoreSecureMountOptions: true (name negotiable 🙂) on the emptyDir spec, as the next people to come along with even more nuanced understanding of what they want will add useEvenMoreSecureMountOptions rather than extending the allow list.

@adampl
Copy link

adampl commented Aug 2, 2019

@liggitt First of all, I don't really understand why inline pod volumes are regarded as something separate from normal PVs. For example, why can't we have an emptyDir PV/PVC. See: #75378.

The potential to crash a node can be a problem only for a subset of Kubernetes clusters working on shared (cloud) infrastructure. That for me is not a sufficient reason to block other use cases.

@tallclair
Copy link
Member

/area security

Without following all the context around PV/PVC mountOptions, I'm generally +1 on the idea of adding these (noexec,nosuid,nodev) as options on per-container volume mounts, or as options in the securityContext. I think defaulting those to true might be a hard sell with Kubernetes' stance on backwards compatibility.

@flaupretre
Copy link

Those options are probably useful (especially 'exec') but could it be possible to add a 'uid' volume mount parameter. For a better security, I don't want anything to run as root in my container and I also want my root FS to be readonly. I just need a small subdirectory where I dynamically generate a configuration file before launching the daemon. This small volume does not have to be persisted and emptyDir would fit perfectly if the tmpfs directory it provides could belong to a non-root user.

Using bare docker, I can use the '--tmpfs /dir:rw,uid=$UID" option. In kubernetes, I have no way to set the 'uid=' option. So, I cannot combine using a read-only root FS and running as non-root.

A more common case is the need for a writable /tmp. But enabling non-root users to write to /tmp requires a chmod which requires to be executed as root. Same case: Need to start as root and su/sudo to start daemon -> lower security.

@php-coder
Copy link
Contributor

Using bare docker, I can use the '--tmpfs /dir:rw,uid=$UID" option. In kubernetes, I have no way to set the 'uid=' option.

It sounds like a use case for the securityContext.fsGroup field (see kubectl explain Pod.spec.securityContext.fsGroup).

A more common case is the need for a writable /tmp. But enabling non-root users to write to /tmp requires a chmod which requires to be executed as root.

Could you elaborate a little more on the last sentence, please? I don't see where you would need chmod:

$ cat test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  generateName: test-pod-
spec:
  containers:
  - image: ubuntu
    name: test-container
    command: [ "/bin/bash" ]
    args: [ "-c", "id; touch /tmp/test-file; ls -l /tmp" ]
    securityContext:
      runAsUser: 1000
      readOnlyRootFilesystem: true
    volumeMounts:
    - name: tmp-volume
      mountPath: /tmp
  restartPolicy: Never
  volumes:
    - name: tmp-volume
      emptyDir: {}
$ kubectl.exe create -f test-pod.yaml
pod/test-pod-77q5h created
$ kubectl.exe logs pod/test-pod-77q5h
uid=1000 gid=0(root) groups=0(root)
total 0
-rw-r--r-- 1 1000 root 0 Oct  7 11:57 test-file

@flaupretre
Copy link

You're right. I thought emptyDir would set the mount point to 755 permissions, but it is 777. So, no chmod needed.

@sneko
Copy link

sneko commented Feb 13, 2020

Hi,

I installed "Velero" and this one tries to exec something on a volume that was by default mounted from an emptyDir but it fails because of permission. After checking with mount I can see this partition with the noexec flag.

I'm the only one having that, I just ran the commands listed in the how-to.

Note that a year ago I got problems with Jenkins too, I had to mount its volumes on the host volume otherwise it was unable to run its custom Jenkins scripts (still partition with noexec when coming from emptyDir).

I'm with a Kubernetes cluster v1.15 managed by Kops on GCE... What could make that possible? That's so strange that your help is welcome :) :)

@rosskusler
Copy link

On a kubeadm v1.18.15 cluster I created a pod with a readonly root filesystem by emptyDir /tmp. I was able to exec into the container, create a script in /tmp, and execute it. It would be great if we could specify mountOptions for emptyDir volumes.

@ashutoshvsingh
Copy link

Hi Sorry,
It is a long thread, difficult to follow. Someone asked about the attack vector so I can say that it is unsafe in circumstances where we are running web_services that are working on user-generated content (UGC). Liks say there is an attachment in your mail that I am processing for virus scan, and that executable can then execute in that tmpfs. I want to mount it with no execute permission.
Ash

@mfriedenhagen
Copy link

Just to come with a IMO valid usecase which would easily break with noexec: we always runAsUser different than root and readOnlyRootFilesystem. And we use single-file option for pyinstaller. pyinstaller creates standalone executables for Python programs which contain a stripped down Python environment. The produced binary unpacks the Python environment to /tmp by default.

@tspearconquest
Copy link

tspearconquest commented Jun 7, 2022

On a kubeadm v1.18.15 cluster I created a pod with a readonly root filesystem by emptyDir /tmp. I was able to exec into the container, create a script in /tmp, and execute it. It would be great if we could specify mountOptions for emptyDir volumes.

This is what I want to prevent in my clusters. We have some pods that run non-root with readonly rootfs, but which uses a writable /tmp from an emptyDir: {} directory-backed inline pod volume, and we want to use the 3 options mentioned by the OP on the writable mountpoint inside the container to prevent the scenario described in the quote above.

The main concern is that being able to execute any file marked as executable, or worse, suid and executable, in a writable directory is a big problem that needs more attention, and it would be really appreciated if this was prioritized higher as it seems to be a security weakness in kubernetes to me.

I'd also like to cast my vote that the most common use case desired is where the mount options are defined in the container spec, not the volume spec. I care less about the backing storage medium; what matters is that the container has the mount options set so that anything running in the container has no ability to execute from any of the writable volumes.

@nrvnrvn
Copy link
Author

nrvnrvn commented Mar 11, 2023

Sweeping dust off this issue.

Motivation

CIS Benchmark for distribution-independent Linux has the following controls (and those controls propagate to other distribution-dependent benchmarks):

Id Title Description
1.1.3 Ensure nodev option set on /tmp partition The nodev mount option specifies that the filesystem cannot contain special devices. Rationale: Since the /tmp filesystem is not intended to support devices, set this option to ensure that users cannot attempt to create block or character special devices in /tmp.
1.1.4 Ensure nosuid option set on /tmp partition The nosuid mount option specifies that the filesystem cannot contain setuid files. Rationale: Since the /tmp filesystem is only intended for temporary file storage, set this option to ensure that users cannot create setuid files in /tmp.
1.1.5 Ensure noexec option set on /tmp partition The noexec mount option specifies that the filesystem cannot contain executable binaries. Rationale: Since the /tmp filesystem is only intended for temporary file storage, set this option to ensure that users cannot run executable binaries from /tmp.

Although the aforementioned controls were created with a generic Linux OS in mind and and covers the /tmp partition specifically it is applicable to containers as well.

In the sense that in the ideal scenario one would like to ensure only one executable is being executed inside the contanier and that executable comes from the image's Entrypoint or Cmd (command or args in Kubernetes container spec respectively) and there are no places left to download and execute code apart from those pre-defined in the Pod spec. Unfortunately you cannot mandate all mounted volumes to have noexec flag. Example usecases:

  1. as @tspearconquest described: run a non-root container with read-only rootfs and with one or more inline tmpfs(emptyDir) volumes mounted for storing some data required for the main application to function properly. In a normal use case mounting these types of inline volumes with nosuid,noexec,nodev options is completely valid and will work flawlessly.
  2. Some application though - due to the way they have been designed - do execute some code from those "ephemeral" locations. I won't provide any specific examples but I have run into this while running some third-party systems with docker and --tmpfs mounts where I had to remove the noexec flag to resolve the issue.

Possible solution

We could reconsider the decision to disallow mountOptions for inline volumes and re-use the github.com/moby/sys logic to validate mount options set up on emptyDir volumes individually to address all emptyDir use cases and preserve the current behavior to maintain backwards compatibility for the meantime.

Rationale

I have read through the kubernetes/community#321 and as far as I understand the decision to not allow users define mountOptions for inline volumes comes from the risk of node crash as suggested by @wongma7.

Polluting securityContext with Yet Another Boolean Option is not an option as suggested by @glb as it will introduce yet another special case and we are already in the position where inline volumes are treated differently from inline volumes otherwise this issue and discussion would probably be non-existent.

Docker and Nerdctl, however, allow to pass mount options to tmpfs mounts (which can be considered as an equivalent to emptyDir in particular and inline volumes in general). Docker does it through maintaining a list of valid mount options and does not accept silly things like errors=panic thus protecting the host.

Example:

❯ docker run --rm --tmpfs /lol:errors=panic busybox                                         
docker: Error response from daemon: Invalid tmpfs option ["errors" "panic"].

See

https://github.com/moby/moby/blob/2fa66cfce2523f6e38bcccc1161055b46c3c6a90/daemon/oci_linux.go#L584-L587

and

https://github.com/moby/sys/blob/c161267cd2212985f81a50a029001f82c91bca7f/mount/flags_unix.go#L103-L106

and nerdctl that re-uses the github.com/moby/sys logic and also resorts to nosuid,noexec,nodev default mount options for tmpfs:

https://github.com/containerd/nerdctl/blob/0a125c1a3346162a31b3c2bee3e205f97fd28c6b/pkg/mountutil/mountutil_linux.go#L265-L273

Apart from docker and nerdctl I have not performed prior art analysis for other container runtimes and clients but I am curious and will check how podman handles this.

References

@keymandll
Copy link

I also desperately need being able to set mountOptions on emptyDir mounts. I'd need to add the noexec and nosuid mount options. Has there been any progress on this issue lately?

@carlory
Copy link
Member

carlory commented Jul 28, 2023

Has there been any progress on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/security kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.