Skip to content

feat: allow users to pause prebuilt workspace reconciliation #18700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 2, 2025

Conversation

SasSwart
Copy link
Contributor

@SasSwart SasSwart commented Jul 1, 2025

This PR provides two commands:

  • coder prebuilds pause
  • coder prebuilds resume

These allow the suspension of all prebuilds activity, intended for use if prebuilds are misbehaving.

@SasSwart SasSwart changed the title feat: add cli commands to pause and resume prebuilt workspace reconci… feat: provide a mechanism to pause prebuilt workspace reconciliation Jul 1, 2025
@SasSwart SasSwart changed the title feat: provide a mechanism to pause prebuilt workspace reconciliation feat: allow users to pause prebuilt workspace reconciliation Jul 1, 2025
@SasSwart SasSwart requested review from johnstcn and ssncferreira July 1, 2025 19:02
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far, just missing the final piece of plumbing

@SasSwart SasSwart requested a review from johnstcn July 2, 2025 08:26
@SasSwart SasSwart marked this pull request as ready for review July 2, 2025 08:29
@johnstcn
Copy link
Member

johnstcn commented Jul 2, 2025

Might be good to also avoid the term prebuilds in the CLI command unless there is prior art

EDIT: prior art exists with codersdk.PrebuildsConfig and in numerous other places in the codebase. I don't have any better naming suggestion, so leave it as coder prebuilds

@@ -29,6 +29,7 @@ const (
MetricEligibleGauge = namespace + "eligible"
MetricPresetHardLimitedGauge = namespace + "preset_hard_limited"
MetricLastUpdatedGauge = namespace + "metrics_last_updated"
MetricReconciliationPausedGauge = namespace + "reconciliation_paused"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paused is difficult to reason about; I'd suggest inverting and make it _running so that a good value is 1 and a bad value is 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paused in this case has a technical meaning relating to a specific setting. If we rename this to running then I think we need to change the setting to be {"reconciliation_running": true}, which doesn't communicate the intent of the API clearly.

I'd like to keep it how it currently is, but as compensatory measures I will:

  • update the metric documentation to be clear about its meaning
  • add a second metric to represent a general "check engine light" for prebuilds

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be pretty handy, LGTM!

return
}

if bytes.Equal(settingsJSON, []byte(currentSettingsJSON)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not a problem now but when we add more settings, map ordering may cause this check to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch.

require.Len(t, workspaces, 2, "should have created 2 prebuilds")

// Now pause prebuilds reconciliation
err = db.UpsertPrebuildsSettings(ctx, `{"reconciliation_paused": true}`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest going via the API here to avoid duplicating this business logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Contributor

@ssncferreira ssncferreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉 Thanks for creating this.
There is just a question about rbac: with this change we require the permission rbac.ResourceDeploymentConfig, so that means that only the owner of the resouce or a user with role auditorRole can upsert this setting. Should we extend it to Admins?

@@ -5101,6 +5105,13 @@ func (q *querier) UpsertOAuthSigningKey(ctx context.Context, value string) error
return q.db.UpsertOAuthSigningKey(ctx, value)
}

func (q *querier) UpsertPrebuildsSettings(ctx context.Context, value string) error {
if err := q.authorizeContext(ctx, policy.ActionUpdate, rbac.ResourceDeploymentConfig); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify here on the rbac side, this means that only a user with this permission can update the setting. From https://github.com/coder/coder/blob/main/coderd/rbac/roles.go currently, only the auditorRole has this permission. Is that intended?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auditor only has read though?

ResourceDeploymentConfig.Type: {policy.ActionRead},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch! 👀 So it means that only the owner of the ResourceDeploymentConfig can update the settings, right? Should we maybe extend it to the Admins? 🤔

Copy link
Member

@johnstcn johnstcn Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there exists any other suitable role that should have permission to edit site-wide deployment config? Our choices are site-wide template admin or site-wide user admin, neither of which feel right to me. Site-wide auditor should only have read permissions, and all of the org-scoped admins are out of site-wide scope.

This might be something to explore in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could definitely be addressed in a separate PR. You might be right that adding the permission to admin roles isn’t the best approach. My main concern was that currently only users with the Owner role can update this setting. In our dogfood environment, this works fine since many users have the Owner role, but this is not the typical case.

On second thought, limiting this permission to the Owner role does make sense, since it’s an important command and users should be fully aware of the implications when using it.

@@ -35,6 +35,11 @@ type NotificationsSettings struct {
NotifierPaused bool `db:"notifier_paused" json:"notifier_paused"`
}

type PrebuildsSettings struct {
ID uuid.UUID `db:"id" json:"id"`
ReconciliationPaused bool `db:"reconciliation_paused" json:"reconciliation_paused"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would also be useful to add a timestamp?


if bytes.Equal(settingsJSON, []byte(currentSettingsJSON)) {
// See: https://www.rfc-editor.org/rfc/rfc7232#section-4.1
httpapi.Write(ctx, rw, http.StatusNotModified, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: according to the RFC the 304 Not Modified is normally a response to a GET request.
Maybe a 204 No Content makes more sense in this case, wdyt? https://datatracker.ietf.org/doc/html/rfc7231#section-6.3.5

return xerrors.Errorf("unable to pause prebuilds: %w", err)
}

_, _ = fmt.Fprintln(inv.Stderr, "Prebuilds are now paused.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not?

Suggested change
_, _ = fmt.Fprintln(inv.Stderr, "Prebuilds are now paused.")
_, _ = fmt.Fprintln(inv.Stdout, "Prebuilds are now paused.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was that this command doesn't actually produce any output of its own.
It reports its own status, but that's more logging than output.

assert.Contains(t, buf.String(), "Prebuilds are now paused.")

// Verify the settings were actually updated
//nolint:gocritic // Only owners can change deployment settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the //nolint:gocritic? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a rule to check that we test with non-owner roles as they're usually more appropriate. In this case, owner is the appropriate role.

err = db.UpsertPrebuildsSettings(ctx, `{"reconciliation_paused": false}`)
require.NoError(t, err)

// Run reconciliation again - it should now recreate the prebuilds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test 🤩

@SasSwart SasSwart enabled auto-merge (squash) July 2, 2025 14:55
@SasSwart SasSwart merged commit 01163ea into main Jul 2, 2025
39 checks passed
@SasSwart SasSwart deleted the jjs/internal-368 branch July 2, 2025 15:05
@github-actions github-actions bot locked and limited conversation to collaborators Jul 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants