publish an event when the container is restarted #126474

Ritikaa96 · 2024-07-31T03:42:17Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This adds a container event for container restarts.

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?

Added a new container event for container restart which will be visible in the events section of kubectl describe of pod.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Not needed.

Not needed.

Ritikaa96 · 2024-07-31T03:43:21Z

/sig api-machinery
/sig node
/do-not-merge work-in-progress

Ritikaa96 · 2024-07-31T03:45:13Z

This PR in first commit is to give an idea what is desired output for the resolution of issue: #123176 .
The approach might change in future after more community discussion.

bart0sh · 2024-07-31T08:18:15Z

/cc @harche @saschagrunert

harche · 2024-07-31T13:31:39Z

/retitle publish an event when the container is restarted

pkg/kubelet/kuberuntime/kuberuntime_container.go

seans3 · 2024-07-31T23:24:31Z

/remove-sig api-machinery

pkg/kubelet/kuberuntime/kuberuntime_container.go

bart0sh · 2024-11-27T15:22:17Z

@Ritikaa96 please, squash your comments into one, thanks!

Ritikaa96 · 2024-11-28T10:33:10Z

@Ritikaa96 please, squash your comments into one, thanks!

Sure @bart0sh .

Signed-off-by: Ritikaa96 <[email protected]>

Ritikaa96 · 2025-02-12T04:44:17Z

Hi @SergeyKanzhelev @tallclair Our use case is based on utilising the events and users depends on it. I understand and agree events are not guaranteed to be available but they surely helps us in getting our results and observation.
Adding ContainerDied or ContainerTerminated event is essential for our issue here.

I'll be happy to work on long term solution for better container state clarity , however this event addition can solve our issue as per the use case.

Ritikaa96 · 2025-02-24T06:16:35Z

Hi @SergeyKanzhelev @tallclair Our use case is based on utilising the events and users depends on it. I understand and agree events are not guaranteed to be available but they surely helps us in getting our results and observation. Adding ContainerDied or ContainerTerminated event is essential for our issue here.

I'll be happy to work on long term solution for better container state clarity , however this event addition can solve our issue as per the use case.

Any word for this? Please let me know whats your opinion.

Ritikaa96 · 2025-03-06T08:48:28Z

@SergeyKanzhelev This addition will also be useful for the users to differentiate the container state as it is for us. Still waiting for further discussion from reviewers

Ritikaa96 · 2025-03-11T16:53:50Z

Hi @tallclair @SergeyKanzhelev @bart0sh
To explain the need to add this event is that our requirement lies in the "timestamp and specifically state change" which we want to access from event and further details like reason can be filtered out from Logs. I understand the pod logs are more reliable but as events are real-time insights into state changes and other significant occurrences within cluster. I believe events are there for some benefit of the user and they can help in identifying issues early on atleast.

ffromani · 2025-03-11T17:56:56Z

I for msyelf I'm not against adding events for something which is supposed to happen rarely like container restart. We should take care to not cause extra load to the apiserver, like always, but conceptually I don't see any blockers.

But I second the considerations about the usefulness of the events. Meaning, events are garbage collected and, if memory serves, most relaxed setting still don't retain events past few hours, so events would be little helpful in troubleshooting issues which are, say from last week. I hit a similar condition in the past few months and this signifincantly cooled down my interest in events.

There's another side though. Events are most useful when someone is actively monitoring them, but that someone doesn't need to be a human. May (and probably should) be a program. So, is the usecase we're trying to fix ultimately related to observability? do we need a signal to detect more easily that a container is restarted? can other observability mechanism be more useful?

Ritikaa96 · 2025-03-12T05:15:13Z

Hi @ffromani thanks for your insight.

Events are most useful when someone is actively monitoring them, but that someone doesn't need to be a human.

You are right , I believe default event retention value is 1 hour. There are also methods to customise event retention though.

However the real time changes are also usable in case of automation based on them and for me here, i need only the real time changes.
Cluster using Automation scripts to detect pod failure can utilise such "event trigger" .The script can be more effectively triggered by discrete container state changes, such as container death.

Ritikaa96 · 2025-03-12T05:15:21Z

@ffromani

Currently, k8s just show ContainerStarted --> ContainerStarted event sequence in event record when the container died for some reasons, however it should be in sequence:

ContainerStarted --> ContainerDied --> ContainerStarted or ContainerRestarted ,if it is restarted again after it died once.

It will provide real-time insights into container deaths, enabling better root cause analysis and faster troubleshooting.
I hope I'm able to explain why i raised the need for it.
I don't find any issue in adding one event message if this also aligns with Kubernetes goal of usability, I'll leave the decision on the community though as they know best.

bart0sh · 2025-03-12T08:29:01Z

pkg/kubelet/kuberuntime/kuberuntime_container.go

@@ -239,6 +239,10 @@ func (m *kubeGenericRuntimeManager) startContainer(ctx context.Context, podSandb
 		}
 	}

+	if restartCount > 0 {
+		m.recordContainerEvent(pod, container, "", v1.EventTypeNormal, events.ContainerDied, fmt.Sprintf("Container died. Restarting container %s", container.Name))


This is a little bit strange to see ContainerDied event in the startContainer function. Is there a better place to do this, e.g. when container actually dies?

It was previously named as containerRestarted and during review event reason was changed. There might definitely be some other place where containerDied event is recorded. But as community is still contemplating whether adding this event message is needed or not, its better to first confirm to work on anything further.

Yeah, agreed this isn't the right place to record this. The PLEG emits a ContainerDied lifecycle event, so the simplest approach would probably be to record it in that path, e.g. here:

kubernetes/pkg/kubelet/kubelet.go

Line 2494 in ec80dca

if isSyncPodWorthy(e) {

That way you'll also get a more accurate timestamp, since it wouldn't need to wait for a prior SyncPod iteration to finish.

ffromani · 2025-03-12T08:30:37Z

@ffromani

Currently, k8s just show ContainerStarted --> ContainerStarted event sequence in event record when the container died for some reasons, however it should be in sequence:
* ContainerStarted --> ContainerDied --> ContainerStarted or ContainerRestarted ,if it is restarted again after it died once.
It will provide real-time insights into container deaths, enabling better root cause analysis and faster troubleshooting. I hope I'm able to explain why i raised the need for it. I don't find any issue in adding one event message if this also aligns with Kubernetes goal of usability, I'll leave the decision on the community though as they know best.

ok, so it seems to me you need a more explicit signal when container dies. Were alternative to events be evaluated and then discarded? if so, why? I think this is overarching thought shared by few reviewers.

googs1025 · 2025-03-12T08:36:03Z

It will provide real-time insights into container deaths, enabling better root cause analysis and faster troubleshooting.

In our production env, we usually use metrics to detect container restarts. 🤔

Ritikaa96 · 2025-03-12T08:58:22Z

Hi @googs1025 I understand , there are various ways to attain it.

In my script event proves fruitful and as I said ,not all roles can access pod logs, still adding an event message to show the right sequence of container state change should clarify kubernetes before as it is. even as a normal user i would wish that container states do not change from ContainerStarted to again ContainerStarted. If there is a record of sudden failure the user can understand the container state better which will in turn make easier for developers and operators to explain these small pod failures.

Will adding this event message cause issue or load to kubernetes performance?

ffromani · 2025-03-12T09:17:12Z

Having symmetry in the reporting of container start and stop is a interesting argument to have this change (once inline review comments have been addressed)!

Granted, lack of explicit "container died" signal is irking, but dead container are supposed to restart ASAP, so one can reasonnably expect a rapid back-to-back sequence of "container dead" + "container started". Perhaps, but I'm really thinking aloud, we should coalesce right from the beginning such events in "container REstarted"? not sure we have this already. Maybe this is premature optimization? Didn't explore this line of thought too much.

However I still feel that there it was not good explanation why other alternatives are insufficient or unsuitable. I think this is where the most concern is.

Ritikaa96 · 2025-03-13T04:23:32Z

Granted, lack of explicit "container died" signal is irking, but dead container are supposed to restart ASAP, so one can reasonnably expect a rapid back-to-back sequence of "container dead" + "container started". Perhaps, but I'm really thinking aloud, we should coalesce right from the beginning such events in "container REstarted"? not sure we have this already. Maybe this is premature optimization? Didn't explore this line of thought too much.

I have shared screenshot for multiple container restart and it doesnt show any cumbersome back-to-back sequence of "container dead" + "container started" here i used container restarted ( x times the restart occurred in past XX min) .
If we take only purpose to change this is making the event more readable ,consider this situation:

having a list of events in a system with multiple pods and only one pod is failing
Now when we look at it it show container started again and again when the pod container restart
however there are other pods we are introducing and there containers are also starting.
Now here What is the difference between a container just started afresh & container that is restarted?
Wouldnt introducing a different term for both same some time when a person is self monitoring the system?

Ritikaa96 · 2025-03-13T04:55:52Z

However I still feel that there it was not good explanation why other alternatives are insufficient or unsuitable. I think this is where the most concern is.
Were alternative to events be evaluated and then discarded? if so, why? I think this is overarching thought shared by few reviewers.

Hi @ffromani
The concern is not using other alternatives tools but using the other observability tools along with events to capture changes in real time . While utilising events with observability tools, most of the container state changes are captured ~almost in real time. What my motivation is to capture these changes with minimum latency which understandably occur when we use observability tools. using events alongside with these tools works best for me, having no record in event list for "containerRestart" or "containerDeath" cause the issue. hence I tried the change.

does this clarify the questions behind motivation for the change?

Ritikaa96 · 2025-03-20T16:20:23Z

I still feel that there it was not good explanation why other alternatives are insufficient or unsuitable. I think this is where the most concern is.

@ffromani any thought on this. as a user point of view as well ,having a clean sequence of events( that are created to real time container state change for base level monitoring and troubleshooting) is a big motivation. Regarding back to back containerStarted+ containerDied events, i tested the code change and the event message will not have a back to back sequence, it will show the latest containerRestart or containerDeath with number of times it already happened.

If this is still unacceptable as a motivation or enough for the change , then please do let me know.
Thanks for your responses.
Regards,
Ritika

Ritikaa96 · 2025-03-21T07:41:10Z

@SergeyKanzhelev @dchen1107 @tallclair any final points for this? i hope the use case was clear by my explanation.

aojea · 2025-03-21T08:47:04Z

pkg/kubelet/events/event.go

@@ -20,6 +20,7 @@ package events
 const (
 	CreatedContainer        = "Created"
 	StartedContainer        = "Started"
+	ContainerDied           = "Died"


do we have this concept? what is died? crashed or exited or both? or is it that was killed?

@aojea HI, as per my opinion it is exited as after that event new container will start. WDYT?

This should be Terminated to align with the ContainerState

Ritikaa96 · 2025-03-26T05:23:15Z

I understand community pov about observability tools . but as there are multiple events for container state , it gets confusing to see or explain that why there is no event (even if it shows the last time container died only) for containerDeath.
k8s defines many events, e.g:

Scheduled
AddedInterface
Pulling
Pulled
Created
Started
SuccessfulCreate
Killing

In cases where observability tools and checking pod logs for this one trigger will cost high in implementation, if there is such event to mention containerDeath it will provide simpler solution . Let me know WDYT about it.

Ritikaa96 · 2025-03-26T05:27:55Z

I thought adding one event message would not be difficult as it will also simplifies user experience, however does this create any apiserver load issue if we merge this PR?

Ritikaa96 · 2025-03-26T12:52:57Z

If we can differenciate the "containerStarted" event when the container is newly created with "containerStarted" event when the Container actually restarted , it will resolve the issue im facing.
How about if the event message or status after the container restart shows ContainerRestarted and containerStarted is only used for first start of container. it can be done if we consider the restart count in the code file , if it is >=1 the event message recorded will be containerRestarted

k8s-triage-robot · 2025-06-25T00:07:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tallclair

I'm OK with adding the container terminated event (although the implementation needs some changes), although I think the ContainerStateTerminated part of the container status would have a more accurate timestamp.

I'd also be OK with adding a restart count to the container started event message (here)

tallclair · 2025-06-30T21:04:30Z

pkg/kubelet/events/event.go

@@ -20,6 +20,7 @@ package events
 const (
 	CreatedContainer        = "Created"
 	StartedContainer        = "Started"
+	ContainerDied           = "Died"


This should be Terminated to align with the ContainerState

tallclair · 2025-06-30T21:10:01Z

pkg/kubelet/kuberuntime/kuberuntime_container.go

@@ -239,6 +239,10 @@ func (m *kubeGenericRuntimeManager) startContainer(ctx context.Context, podSandb
 		}
 	}

+	if restartCount > 0 {
+		m.recordContainerEvent(pod, container, "", v1.EventTypeNormal, events.ContainerDied, fmt.Sprintf("Container died. Restarting container %s", container.Name))


Yeah, agreed this isn't the right place to record this. The PLEG emits a ContainerDied lifecycle event, so the simplest approach would probably be to record it in that path, e.g. here:

kubernetes/pkg/kubelet/kubelet.go

Line 2494 in ec80dca

if isSyncPodWorthy(e) {

That way you'll also get a more accurate timestamp, since it wouldn't need to wait for a prior SyncPod iteration to finish.

k8s-ci-robot requested review from dchen1107 and kannon92 July 31, 2024 03:43

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jul 31, 2024

k8s-ci-robot requested review from harche and saschagrunert July 31, 2024 08:18

k8s-ci-robot changed the title ~~Add container died event~~ publish an event when the container is restarted Jul 31, 2024

harche reviewed Jul 31, 2024

View reviewed changes

pkg/kubelet/kuberuntime/kuberuntime_container.go Outdated Show resolved Hide resolved

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jul 31, 2024

k8s-ci-robot added the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Sep 16, 2024

Ritikaa96 force-pushed the add-container-died-event branch from 25a633a to 96a9b5b Compare November 27, 2024 07:03

k8s-ci-robot removed the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Nov 27, 2024

bart0sh reviewed Nov 27, 2024

View reviewed changes

pkg/kubelet/kuberuntime/kuberuntime_container.go Show resolved Hide resolved

Ritikaa96 force-pushed the add-container-died-event branch from 96a9b5b to 4aa6b80 Compare November 29, 2024 04:09

k8s-ci-robot requested review from bart0sh, mrunalp, SergeyKanzhelev and tallclair February 7, 2025 17:59

changing the event reason

b26d7a8

Signed-off-by: Ritikaa96 <[email protected]>

Ritikaa96 force-pushed the add-container-died-event branch from 4aa6b80 to b26d7a8 Compare February 7, 2025 19:32

SergeyKanzhelev moved this from PRs - Needs Approver to PRs - Needs Reviewer in SIG Node CI/Test Board Mar 5, 2025

bart0sh reviewed Mar 12, 2025

View reviewed changes

aojea reviewed Mar 21, 2025

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2025

tallclair reviewed Jun 30, 2025

View reviewed changes

publish an event when the container is restarted #126474

Are you sure you want to change the base?

publish an event when the container is restarted #126474

Uh oh!

Conversation

Ritikaa96 commented Jul 31, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

Ritikaa96 commented Jul 31, 2024

Uh oh!

Ritikaa96 commented Jul 31, 2024

Uh oh!

bart0sh commented Jul 31, 2024

Uh oh!

harche commented Jul 31, 2024

Uh oh!

Uh oh!

seans3 commented Jul 31, 2024

Uh oh!

Uh oh!

bart0sh commented Nov 27, 2024

Uh oh!

Ritikaa96 commented Nov 28, 2024

Uh oh!

Ritikaa96 commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ritikaa96 commented Feb 24, 2025

Uh oh!

Ritikaa96 commented Mar 6, 2025

Uh oh!

Ritikaa96 commented Mar 11, 2025

Uh oh!

ffromani commented Mar 11, 2025

Uh oh!

Ritikaa96 commented Mar 12, 2025

Uh oh!

Ritikaa96 commented Mar 12, 2025

Uh oh!

bart0sh Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

Ritikaa96 Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tallclair Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

ffromani commented Mar 12, 2025

Uh oh!

googs1025 commented Mar 12, 2025

Uh oh!

Ritikaa96 commented Mar 12, 2025

Uh oh!

ffromani commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ritikaa96 commented Mar 13, 2025

Uh oh!

Ritikaa96 commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ritikaa96 commented Mar 20, 2025

Uh oh!

Ritikaa96 commented Mar 21, 2025

Uh oh!

aojea Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ritikaa96 Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tallclair Jun 30, 2025

Ritikaa96 commented Feb 12, 2025 •

edited

Loading

Ritikaa96 Mar 13, 2025 •

edited

Loading

ffromani commented Mar 12, 2025 •

edited

Loading

Ritikaa96 commented Mar 13, 2025 •

edited

Loading

aojea Mar 21, 2025 •

edited

Loading

Ritikaa96 Mar 24, 2025 •

edited

Loading