-
Notifications
You must be signed in to change notification settings - Fork 936
chore: run macOS, windows, and race tests with Postgres in CI #15520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0733229
to
b1c3974
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but It'd be good to hear from Dean who has a lot more experience working with our CI
.github/workflows/ci.yaml
Outdated
api-key: ${{ secrets.DATADOG_API_KEY }} | ||
|
||
# NOTE: this could instead be defined as a matrix strategy, but we want to | ||
# temporarily allow windows tests to fail. Using a matrix strategy here makes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the Windows tests are broken, I'm not sure if we want to merge a CI suite that often fails, even if it's not required (could slightly obscure real issues + not great optics). Perhaps it should just get skipped/commented out until we're able to improve the reliability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do have the Windows tests covered by in memory for now, maybe these should be disabled for now to avoid ❌ appearing everywhere. Are there any issues tracking the windows tests breaking in postgres?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any issues tracking the windows tests breaking in postgres?
I just added #15560.
for now, maybe these should be disabled for now to avoid ❌ appearing everywhere
Makes sense. I'll postpone merging this PR until Windows tests are fixed. I'll try to get it done this week.
.github/workflows/ci.yaml
Outdated
runs-on: ${{ github.repository_owner == 'coder' && 'depot-ubuntu-22.04-8' || 'ubuntu-latest' }} | ||
needs: | ||
- changes | ||
runs-on: ${{ matrix.os == 'ubuntu-latest' && github.repository_owner == 'coder' && 'depot-ubuntu-22.04-4' || matrix.os == 'macos-latest' && github.repository_owner == 'coder' && 'macos-latest-xlarge' || matrix.os }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holy mother of github actions interpolation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have this on go-test
😭
// We execute these queries instead of using the embeddedpostgres | ||
// StartParams because it doesn't work on Windows. The library | ||
// seems to have a bug where it sends malformed parameters to | ||
// pg_ctl. It encloses each parameter in single quotes, which | ||
// Windows can't handle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we open a PR upstream to fix this bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I opened fergusstrange/embedded-postgres#145 and I could submit a PR later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the issue tracking link to this comment?
We're going to postpone merging this PR until Windows tests are fixed. #15560 |
Patches tests that caused Windows Postgres CI in #15520 to consistently fail. I tested this by temporarily adding Postgres Windows CI to this PR. However, I reverted those changes to merge them with #15520. For reference, here's [a passing CI run](https://github.com/coder/coder/actions/runs/11918816662/job/33219786238) from an earlier commit. **Note:** Although Windows tests now pass, they remain quite flaky. I recommend running Postgres Windows CI to gather data on these flakes, but I don’t think it should be a required job just yet.
b3e3fac
to
2739223
Compare
f1394a8
to
160e77c
Compare
@deansheather @ethanndickson The Windows Postgres tests seem stable now. I made one major change in this PR to fix them: Windows now sets up a RAM disk for Postgres. I found that the flakiness was due to timing-sensitive tests failing because the suite ran slowly on Windows. The root cause was IO performance on GitHub-hosted Windows runners. Using a RAM disk bypasses this issue, and test times are back to normal. I’ve also made the Windows PG tests required since they’re no longer that flaky - I see an odd flake every couple of runs. I’d appreciate a second look before merging this into main. Thanks! |
curl -L -o files.cab https://github.com/coder/imdisk-artifacts/raw/92a17839ebc0ee3e69be019f66b3e9b5d2de4482/files.cab | ||
curl -L -o install.bat https://github.com/coder/imdisk-artifacts/raw/92a17839ebc0ee3e69be019f66b3e9b5d2de4482/install.bat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sucks that people still use sourceforge lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's Windows for you
// We execute these queries instead of using the embeddedpostgres | ||
// StartParams because it doesn't work on Windows. The library | ||
// seems to have a bug where it sends malformed parameters to | ||
// pg_ctl. It encloses each parameter in single quotes, which | ||
// Windows can't handle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the issue tracking link to this comment?
We have an effort underway to replace `dbmem` (#15109), and consequently we've begun running our full test-suite (with Postgres) on all supported OSs - Windows, MacOS, and Linux, since #15520. Since this change, we've seen a marked decrease in the success rate of our builds on `main` (note how the Windows/MacOS failures account for the vast majority of failed builds):  We're still investigating why these OSs are a lot less reliable. It's likely that the VMs on which the builds are run have different characteristics from our Ubuntu runners such as disk I/O, network latency, or something else. **In the meantime, we need to start trusting CI failures in `main` again, as the current failures are too noisy / vague for us to correct.** We've also considered hosting our own runners where possible so we can get OS-level observability to rule out some possibilities. See the [meeting notes](https://www.notion.so/coderhq/CI-Investigation-Call-Notes-17dd579be59280d8897cc9fe4bb46695?pvs=6&utm_content=17dd579b-e592-80d8-897c-c9fe4bb46695&utm_campaign=T1ZPT2FL0&n=slack&n=slack_link_unfurl) where we linked into this for more detail. This PR introduces several changes: 1. Moves the full test-suite with Postgres on Windows/MacOS to the `nightly-gauntlet` workflow tradeoff: this means that any regressions may be more difficult to discover since we merge to main several times a day 2. Run only the CLI test-suite on each PR / merge to `main` on Windows/MacOS 3. `test-go` is still running the full test-suite against all OSs (including the CLI ones), but will soon be removed once #15109 is completed since it uses `dbmem` 4. Changes `nightly-gauntlet` to run at 4AM: we've seen several instances of the runner being stopped externally, and we're _guessing_ this may have something to do with the midnight UTC execution time, when other cron jobs may run 5. Removes the existing `nightly-gauntlet` jobs since they haven't passed in a long time, indicating that nobody cares enough to fix them and they don't provide diagnostic value; we can restore them later if necessary I've manually run both these new workflows successfully: - `ci`: https://github.com/coder/coder/actions/runs/12825874176/job/35764724907 - `nightly-gauntlet`: https://github.com/coder/coder/actions/runs/12825539092 --------- Signed-off-by: Danny Kopping <[email protected]> Co-authored-by: Muhammad Atif Ali <[email protected]>
We have an effort underway to replace `dbmem` (#15109), and consequently we've begun running our full test-suite (with Postgres) on all supported OSs - Windows, MacOS, and Linux, since #15520. Since this change, we've seen a marked decrease in the success rate of our builds on `main` (note how the Windows/MacOS failures account for the vast majority of failed builds):  We're still investigating why these OSs are a lot less reliable. It's likely that the VMs on which the builds are run have different characteristics from our Ubuntu runners such as disk I/O, network latency, or something else. **In the meantime, we need to start trusting CI failures in `main` again, as the current failures are too noisy / vague for us to correct.** We've also considered hosting our own runners where possible so we can get OS-level observability to rule out some possibilities. See the [meeting notes](https://www.notion.so/coderhq/CI-Investigation-Call-Notes-17dd579be59280d8897cc9fe4bb46695?pvs=6&utm_content=17dd579b-e592-80d8-897c-c9fe4bb46695&utm_campaign=T1ZPT2FL0&n=slack&n=slack_link_unfurl) where we linked into this for more detail. This PR introduces several changes: 1. Moves the full test-suite with Postgres on Windows/MacOS to the `nightly-gauntlet` workflow tradeoff: this means that any regressions may be more difficult to discover since we merge to main several times a day 2. Run only the CLI test-suite on each PR / merge to `main` on Windows/MacOS 3. `test-go` is still running the full test-suite against all OSs (including the CLI ones), but will soon be removed once #15109 is completed since it uses `dbmem` 4. Changes `nightly-gauntlet` to run at 4AM: we've seen several instances of the runner being stopped externally, and we're _guessing_ this may have something to do with the midnight UTC execution time, when other cron jobs may run 5. Removes the existing `nightly-gauntlet` jobs since they haven't passed in a long time, indicating that nobody cares enough to fix them and they don't provide diagnostic value; we can restore them later if necessary I've manually run both these new workflows successfully: - `ci`: https://github.com/coder/coder/actions/runs/12825874176/job/35764724907 - `nightly-gauntlet`: https://github.com/coder/coder/actions/runs/12825539092 --------- Signed-off-by: Danny Kopping <[email protected]> Co-authored-by: Muhammad Atif Ali <[email protected]>
This PR is the second in a series aimed at closing #15109.
Changes
scripts/embedded-pg/main.go
, which can start a native Postgres database. This is used to set up PG on Windows and macOS, as these platforms don't support Docker in Github Actions.test-go-pg
job on macOS and Windows tootest-go-race-go
job, which runs race tests with Postgres on Linux