-
Notifications
You must be signed in to change notification settings - Fork 928
feat: allow users to pause prebuilt workspace reconciliation #18700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so far, just missing the final piece of plumbing
…whether prebuilds are currently paused
Might be good to also avoid the term |
@@ -29,6 +29,7 @@ const ( | |||
MetricEligibleGauge = namespace + "eligible" | |||
MetricPresetHardLimitedGauge = namespace + "preset_hard_limited" | |||
MetricLastUpdatedGauge = namespace + "metrics_last_updated" | |||
MetricReconciliationPausedGauge = namespace + "reconciliation_paused" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paused is difficult to reason about; I'd suggest inverting and make it _running
so that a good value is 1 and a bad value is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paused
in this case has a technical meaning relating to a specific setting. If we rename this to running
then I think we need to change the setting to be {"reconciliation_running": true}
, which doesn't communicate the intent of the API clearly.
I'd like to keep it how it currently is, but as compensatory measures I will:
- update the metric documentation to be clear about its meaning
- add a second metric to represent a general "check engine light" for prebuilds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be pretty handy, LGTM!
return | ||
} | ||
|
||
if bytes.Equal(settingsJSON, []byte(currentSettingsJSON)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not a problem now but when we add more settings, map ordering may cause this check to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch.
mc.reconciliationPausedMu.RLock() | ||
var pausedValue float64 | ||
if mc.reconciliationPaused { | ||
pausedValue = 1 | ||
} | ||
mc.reconciliationPausedMu.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: simpler to use an atomic.Bool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did it this way to stick to the pattern already used above. I agree with the nit, but I'm going to prioritize getting this deployed.
var settings codersdk.PrebuildsSettings | ||
if len(settingsJSON) > 0 { | ||
if err := json.Unmarshal([]byte(settingsJSON), &settings); err != nil { | ||
return xerrors.Errorf("unmarshal prebuilds settings: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we fail to unmarshal the settings, this will effectively pause the reconciliation - so it's probably worth hardcoding paused=true in the metric collector if this happens.
Or we need to have another metric which indicates the recloop is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we need to have another metric which indicates the recloop is broken.
Going to opt for this.
See my comment about a "check engine light" above.
require.Len(t, workspaces, 2, "should have created 2 prebuilds") | ||
|
||
// Now pause prebuilds reconciliation | ||
err = db.UpsertPrebuildsSettings(ctx, `{"reconciliation_paused": true}`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest going via the API here to avoid duplicating this business logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
This PR provides two commands:
coder prebuilds pause
coder prebuilds resume
These allow the suspension of all prebuilds activity, intended for use if prebuilds are misbehaving.