-
Notifications
You must be signed in to change notification settings - Fork 40.9k
publish an event when the container is restarted #126474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/sig api-machinery |
This PR in first commit is to give an idea what is desired output for the resolution of issue: #123176 . |
/retitle publish an event when the container is restarted |
/remove-sig api-machinery |
25a633a
to
96a9b5b
Compare
@Ritikaa96 please, squash your comments into one, thanks! |
Sure @bart0sh . |
96a9b5b
to
4aa6b80
Compare
Signed-off-by: Ritikaa96 <[email protected]>
4aa6b80
to
b26d7a8
Compare
Hi @SergeyKanzhelev @tallclair Our use case is based on utilising the events and users depends on it. I understand and agree events are not guaranteed to be available but they surely helps us in getting our results and observation. I'll be happy to work on long term solution for better container state clarity , however this event addition can solve our issue as per the use case. |
Any word for this? Please let me know whats your opinion. |
@SergeyKanzhelev This addition will also be useful for the users to differentiate the container state as it is for us. Still waiting for further discussion from reviewers |
Hi @tallclair @SergeyKanzhelev @bart0sh |
I for msyelf I'm not against adding events for something which is supposed to happen rarely like container restart. We should take care to not cause extra load to the apiserver, like always, but conceptually I don't see any blockers. But I second the considerations about the usefulness of the events. Meaning, events are garbage collected and, if memory serves, most relaxed setting still don't retain events past few hours, so events would be little helpful in troubleshooting issues which are, say from last week. I hit a similar condition in the past few months and this signifincantly cooled down my interest in events. There's another side though. Events are most useful when someone is actively monitoring them, but that someone doesn't need to be a human. May (and probably should) be a program. So, is the usecase we're trying to fix ultimately related to observability? do we need a signal to detect more easily that a container is restarted? can other observability mechanism be more useful? |
Hi @ffromani thanks for your insight.
You are right , I believe default event retention value is 1 hour. There are also methods to customise event retention though. However the real time changes are also usable in case of automation based on them and for me here, i need only the real time changes. |
Currently, k8s just show ContainerStarted --> ContainerStarted event sequence in event record when the container died for some reasons, however it should be in sequence:
It will provide real-time insights into container deaths, enabling better root cause analysis and faster troubleshooting. |
@@ -239,6 +239,10 @@ func (m *kubeGenericRuntimeManager) startContainer(ctx context.Context, podSandb | |||
} | |||
} | |||
|
|||
if restartCount > 0 { | |||
m.recordContainerEvent(pod, container, "", v1.EventTypeNormal, events.ContainerDied, fmt.Sprintf("Container died. Restarting container %s", container.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little bit strange to see ContainerDied
event in the startContainer
function. Is there a better place to do this, e.g. when container actually dies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was previously named as containerRestarted and during review event reason was changed. There might definitely be some other place where containerDied event is recorded. But as community is still contemplating whether adding this event message is needed or not, its better to first confirm to work on anything further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed this isn't the right place to record this. The PLEG emits a ContainerDied lifecycle event, so the simplest approach would probably be to record it in that path, e.g. here:
kubernetes/pkg/kubelet/kubelet.go
Line 2494 in ec80dca
if isSyncPodWorthy(e) { |
That way you'll also get a more accurate timestamp, since it wouldn't need to wait for a prior SyncPod iteration to finish.
ok, so it seems to me you need a more explicit signal when container dies. Were alternative to events be evaluated and then discarded? if so, why? I think this is overarching thought shared by few reviewers. |
In our production env, we usually use metrics to detect container restarts. 🤔 |
Hi @googs1025 I understand , there are various ways to attain it. In my script event proves fruitful and as I said ,not all roles can access pod logs, still adding an event message to show the right sequence of container state change should clarify kubernetes before as it is. even as a normal user i would wish that container states do not change from ContainerStarted to again ContainerStarted. If there is a record of sudden failure the user can understand the container state better which will in turn make easier for developers and operators to explain these small pod failures. Will adding this event message cause issue or load to kubernetes performance? |
Having symmetry in the reporting of container start and stop is a interesting argument to have this change (once inline review comments have been addressed)! Granted, lack of explicit "container died" signal is irking, but dead container are supposed to restart ASAP, so one can reasonnably expect a rapid back-to-back sequence of "container dead" + "container started". Perhaps, but I'm really thinking aloud, we should coalesce right from the beginning such events in "container REstarted"? not sure we have this already. Maybe this is premature optimization? Didn't explore this line of thought too much. However I still feel that there it was not good explanation why other alternatives are insufficient or unsuitable. I think this is where the most concern is. |
I have shared screenshot for multiple container restart and it doesnt show any cumbersome back-to-back sequence of "container dead" + "container started" here i used container restarted ( x times the restart occurred in past XX min) .
|
Hi @ffromani does this clarify the questions behind motivation for the change? |
@ffromani any thought on this. as a user point of view as well ,having a clean sequence of events( that are created to real time container state change for base level monitoring and troubleshooting) is a big motivation. Regarding back to back containerStarted+ containerDied events, i tested the code change and the event message will not have a back to back sequence, it will show the latest containerRestart or containerDeath with number of times it already happened. If this is still unacceptable as a motivation or enough for the change , then please do let me know. |
@SergeyKanzhelev @dchen1107 @tallclair any final points for this? i hope the use case was clear by my explanation. |
@@ -20,6 +20,7 @@ package events | |||
const ( | |||
CreatedContainer = "Created" | |||
StartedContainer = "Started" | |||
ContainerDied = "Died" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have this concept? what is died? crashed or exited or both? or is it that was killed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea HI, as per my opinion it is exited as after that event new container will start. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be Terminated
to align with the ContainerState
I understand community pov about observability tools . but as there are multiple events for container state , it gets confusing to see or explain that why there is no event (even if it shows the last time container died only) for containerDeath.
In cases where observability tools and checking pod logs for this one trigger will cost high in implementation, if there is such event to mention containerDeath it will provide simpler solution . Let me know WDYT about it. |
I thought adding one event message would not be difficult as it will also simplifies user experience, however does this create any apiserver load issue if we merge this PR? |
If we can differenciate the "containerStarted" event when the container is newly created with "containerStarted" event when the Container actually restarted , it will resolve the issue im facing. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with adding the container terminated event (although the implementation needs some changes), although I think the ContainerStateTerminated
part of the container status would have a more accurate timestamp.
I'd also be OK with adding a restart count to the container started event message (here)
@@ -20,6 +20,7 @@ package events | |||
const ( | |||
CreatedContainer = "Created" | |||
StartedContainer = "Started" | |||
ContainerDied = "Died" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be Terminated
to align with the ContainerState
@@ -239,6 +239,10 @@ func (m *kubeGenericRuntimeManager) startContainer(ctx context.Context, podSandb | |||
} | |||
} | |||
|
|||
if restartCount > 0 { | |||
m.recordContainerEvent(pod, container, "", v1.EventTypeNormal, events.ContainerDied, fmt.Sprintf("Container died. Restarting container %s", container.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed this isn't the right place to record this. The PLEG emits a ContainerDied lifecycle event, so the simplest approach would probably be to record it in that path, e.g. here:
kubernetes/pkg/kubelet/kubelet.go
Line 2494 in ec80dca
if isSyncPodWorthy(e) { |
That way you'll also get a more accurate timestamp, since it wouldn't need to wait for a prior SyncPod iteration to finish.
What type of PR is this?
/kind feature
What this PR does / why we need it:
This adds a container event for container restarts.
Which issue(s) this PR fixes:
Fixes ##123176
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
Not needed.