-
Notifications
You must be signed in to change notification settings - Fork 40.9k
[WIP] Dra device health status #130606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP] Dra device health status #130606
Conversation
Hi @Jpsassine. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
@Jpsassine Thank you for your PR. Please sign the CLA to proceed further, thanks. BTW, is there any KEP or another design document describing/discussing these changes? If so, please provide links in the PR description. |
/easycla |
206e688
to
87d4a5d
Compare
} | ||
return err | ||
} | ||
return json.Unmarshal(data, cache.HealthInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we wrap error with specific message to make it easier to troubleshoot?
85fc598
to
4512f49
Compare
This commit implements the Pod Resource Health Status feature as described in KEP-4680. This allows Dynamic Resource Allocation (DRA) plugins to report the health of their devices back to the Kubelet. The primary changes include: - A new, optional, server-streaming gRPC service, `v1alpha1.NodeHealth`, is introduced. DRA drivers can implement the `WatchResources` RPC on this service to stream health updates. - The Kubelet's `dra.Manager` is updated to act as a client to this service. It listens for health updates, maintains an internal health cache for all devices, and triggers pod status updates when a device's health changes. - The Pod's `AllocatedResourcesStatus` field is now populated with the health ("Healthy", "Unhealthy", "Unknown") of its assigned devices, making this information visible via the Kubernetes API. Add first e2e test & helper funcs trying to fix pod syncing status & add debug logs
4512f49
to
e538497
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Jpsassine The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
9ac3a2a
to
0e72a66
Compare
507632d
to
94bc6a5
Compare
current logic Create DeviceClass. Create ResourceClaim (without status). Create Pod. Update ResourceClaim status (too late) Updated logic Create DeviceClass. Create ResourceClaim (without status). Update ResourceClaim status. Wait for the dra.Manager to see the updated claim Create Pod.
94bc6a5
to
d999489
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR implements KEP-4680 by adding device health tracking for DRA plugins and integrating this status into the PodStatus. It allows Kubelet and users to observe the health of allocated DRA resources.
Optional gRPC Health Service (
dra-health/v1alpha1
):NodeHealthService
with aWatchResources
stream RPC instaging/src/k8s.io/kubelet/pkg/apis/dra-health/v1alpha1/api.proto
.pool_name
,device_name
,health
,Last_updated
) to Kubelet.Kubelet Plugin Integration (
cm/dra/plugin
):RegistrationHandler
(registration.go
) updated to:WatchResources
stream upon plugin registration viaplugin.WatchResources
.NodeHealthService
(or other stream startup errors) by logging the error and proceeding with registration without health monitoring.StreamHandler
interface implemented bydra.ManagerImpl
) in the DRA Manager upon successful stream initiation.HealthStreamCancel
) during plugin deregistration (DeregisterPlugin
) or replacement.Health Cache (
cm/dra/healthinfo.go
,cm/dra/state/state.go
):healthInfoCache
for persistent (dra_health_state
file), thread-safe storage of device health (Healthy
,Unhealthy
,Unknown
) and timestamps.updateHealthInfo
for full-state reconciliation based on plugin updates, handling timeouts (healthTimeout
constant, e.g., 30s) by marking stale devices as "Unknown". Saves checkpoint on change.getHealthInfo
to retrieve current status (returns "Unknown" if stale/missing) andclearDriver
for cleanup.DRA Manager Integration (
cm/dra/manager.go
):ManagerImpl
implements theplugin.StreamHandler
interface.HandleWatchResourcesStream
goroutine consumes updates from the plugin stream (NodeHealth.WatchResourcesClient
), callshealthInfoCache.updateHealthInfo
, finds affected Pod UIDs fromclaimInfoCache
, and sends notifications via an internal update channel (non-blocking).defer healthInfoCache.clearDriver(pluginName)
withinHandleWatchResourcesStream
to ensure cache cleanup for the driver upon goroutine exit (due to error, cancellation, or EOF).Updates()
method to return the update channel.UpdateAllocatedResourcesStatus
to read health fromhealthInfoCache
and populatepod.Status.ContainerStatuses[].AllocatedResourcesStatus
using the KEP-specified structure:v1.ResourceStatus
(named by claim (Name
field), containing aResources
slice where each element is av1.ResourceHealth
struct per device (ResourceID
andHealth
fields)).Container Manager Integration (
cm/container_manager_linux.go
):Updates()
method merges update signals from Device Manager and DRA Manager (viadraManager.Updates()
).UpdateAllocatedResourcesStatus()
method calls both Device Manager and DRA Manager update functions (DRA call guarded by feature gateDynamicResourceAllocation
and nil check).Testing:
healthinfo.go
(healthinfo_test.go
).manager.go
(manager_test.go
) coveringHandleWatchResourcesStream
andUpdateAllocatedResourcesStatus
. Fixes existing tests (TestPrepareResources
,TestUnprepareResources
) to align with updated signatures/logic.Which issue(s) this PR fixes:
Fixes #126243
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: