Open
Description
What happened: I've been seeing a number of evictions recently that appear to be due to disk pressure:
$$$ kubectl get pod kumo-go-api-d46f56779-jl6s2 --namespace=kumo-main -o yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: 2018-12-06T10:05:25Z
generateName: kumo-go-api-d46f56779-
labels:
io.kompose.service: kumo-go-api
pod-template-hash: "802912335"
name: kumo-go-api-d46f56779-jl6s2
namespace: kumo-main
ownerReferences:
- apiVersion: extensions/v1beta1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: kumo-go-api-d46f56779
uid: c0a9355e-f780-11e8-b336-42010aa80057
resourceVersion: "11617978"
selfLink: /api/v1/namespaces/kumo-main/pods/kumo-go-api-d46f56779-jl6s2
uid: 7337e854-f93e-11e8-b336-42010aa80057
spec:
containers:
- env:
- redacted...
image: gcr.io/<redacted>/kumo-go-api@sha256:c6a94fc1ffeb09ea6d967f9ab14b9a26304fa4d71c5798acbfba5e98125b81da
imagePullPolicy: Always
name: kumo-go-api
ports:
- containerPort: 5000
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-t6jkx
readOnly: true
dnsPolicy: ClusterFirst
nodeName: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-t6jkx
secret:
defaultMode: 420
secretName: default-token-t6jkx
status:
message: 'The node was low on resource: nodefs.'
phase: Failed
reason: Evicted
startTime: 2018-12-06T10:05:25Z
Taking a look at kubectl get events
, I see these warnings:
$$$ kubectl get events
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
2m 13h 152 gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91 Node Warning ImageGCFailed kubelet, gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s (combined from similar events): failed to garbage collect required amount of images. Wanted to free 473948979 bytes, but freed 0 bytes
37m 37m 1 gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e3127ebc715c3 Node Warning ImageGCFailed kubelet, gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s failed to garbage collect required amount of images. Wanted to free 473674547 bytes, but freed 0 bytes
Digging a bit deeper:
$$$ kubectl get event gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91 -o yaml
apiVersion: v1
count: 153
eventTime: null
firstTimestamp: 2018-12-07T11:01:06Z
involvedObject:
kind: Node
name: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
uid: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
kind: Event
lastTimestamp: 2018-12-08T00:16:09Z
message: '(combined from similar events): failed to garbage collect required amount
of images. Wanted to free 474006323 bytes, but freed 0 bytes'
metadata:
creationTimestamp: 2018-12-07T11:01:07Z
name: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91
namespace: default
resourceVersion: "381976"
selfLink: /api/v1/namespaces/default/events/gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91
uid: 65916e4b-fa0f-11e8-ae9a-42010aa80058
reason: ImageGCFailed
reportingComponent: ""
reportingInstance: ""
source:
component: kubelet
host: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
type: Warning
There's actually remarkably little here. This message doesn't say anything regarding why ImageGC was initiated or why it was unable recover more space.
What you expected to happen: Image GC to work correctly, or at least fail to schedule pods onto nodes that do not have sufficient disk space.
How to reproduce it (as minimally and precisely as possible): Run and stop as many pods as possible on a node in order to encourage disk pressure. Then observe these errors.
Anything else we need to know?: n/a
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-gke.11", GitCommit:"fa90543563c9cfafca69128ce8cd9ecd5941940f", GitTreeState:"clean", BuildDate:"2018-11-08T20:22:21Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: GKE
- OS (e.g. from /etc/os-release): I'm running macOS 10.14, nodes are running Container-Optimized OS (cos).
- Kernel (e.g.
uname -a
):Darwin D-10-19-169-80.dhcp4.washington.edu 18.0.0 Darwin Kernel Version 18.0.0: Wed Aug 22 20:13:40 PDT 2018; root:xnu-4903.201.2~1/RELEASE_X86_64 x86_64
- Install tools: n/a
- Others: n/a
/kind bug
Metadata
Metadata
Labels
Denotes an issue ready for a new contributor, according to the "help wanted" guidelines.Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.Categorizes issue or PR as related to a bug.Categorizes an issue or PR as relevant to SIG Node.Indicates an issue or PR is ready to be actively worked on.
Type
Projects
Status
Triaged