Skip to content

chore: run macOS, windows, and race tests with Postgres in CI #15520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Dec 3, 2024

Conversation

hugodutka
Copy link
Contributor

@hugodutka hugodutka commented Nov 14, 2024

This PR is the second in a series aimed at closing #15109.

Changes

  • adds scripts/embedded-pg/main.go, which can start a native Postgres database. This is used to set up PG on Windows and macOS, as these platforms don't support Docker in Github Actions.
  • runs the test-go-pg job on macOS and Windows too
  • adds the test-go-race-go job, which runs race tests with Postgres on Linux

@hugodutka hugodutka force-pushed the hugodutka/enable-pg-ci-macos-windows branch 2 times, most recently from 0733229 to b1c3974 Compare November 15, 2024 12:50
@hugodutka hugodutka changed the title chore: run Postgres tests on macos and windows in CI chore: run macOS, windows, and race tests with Postgres in CI Nov 15, 2024
@hugodutka hugodutka marked this pull request as ready for review November 15, 2024 16:38
Copy link
Member

@ethanndickson ethanndickson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but It'd be good to hear from Dean who has a lot more experience working with our CI

api-key: ${{ secrets.DATADOG_API_KEY }}

# NOTE: this could instead be defined as a matrix strategy, but we want to
# temporarily allow windows tests to fail. Using a matrix strategy here makes
Copy link
Member

@ethanndickson ethanndickson Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the Windows tests are broken, I'm not sure if we want to merge a CI suite that often fails, even if it's not required (could slightly obscure real issues + not great optics). Perhaps it should just get skipped/commented out until we're able to improve the reliability?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do have the Windows tests covered by in memory for now, maybe these should be disabled for now to avoid ❌ appearing everywhere. Are there any issues tracking the windows tests breaking in postgres?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any issues tracking the windows tests breaking in postgres?

I just added #15560.

for now, maybe these should be disabled for now to avoid ❌ appearing everywhere

Makes sense. I'll postpone merging this PR until Windows tests are fixed. I'll try to get it done this week.

runs-on: ${{ github.repository_owner == 'coder' && 'depot-ubuntu-22.04-8' || 'ubuntu-latest' }}
needs:
- changes
runs-on: ${{ matrix.os == 'ubuntu-latest' && github.repository_owner == 'coder' && 'depot-ubuntu-22.04-4' || matrix.os == 'macos-latest' && github.repository_owner == 'coder' && 'macos-latest-xlarge' || matrix.os }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy mother of github actions interpolation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have this on go-test 😭

Comment on lines +32 to +45
// We execute these queries instead of using the embeddedpostgres
// StartParams because it doesn't work on Windows. The library
// seems to have a bug where it sends malformed parameters to
// pg_ctl. It encloses each parameter in single quotes, which
// Windows can't handle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we open a PR upstream to fix this bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I opened fergusstrange/embedded-postgres#145 and I could submit a PR later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the issue tracking link to this comment?

@hugodutka
Copy link
Contributor Author

We're going to postpone merging this PR until Windows tests are fixed. #15560

hugodutka added a commit that referenced this pull request Nov 20, 2024
Patches tests that caused Windows Postgres CI in
#15520 to consistently fail.

I tested this by temporarily adding Postgres Windows CI to this PR.
However, I reverted those changes to merge them with
#15520. For reference, here's [a
passing CI
run](https://github.com/coder/coder/actions/runs/11918816662/job/33219786238)
from an earlier commit.

**Note:** Although Windows tests now pass, they remain quite flaky. I
recommend running Postgres Windows CI to gather data on these flakes,
but I don’t think it should be a required job just yet.
@hugodutka hugodutka force-pushed the hugodutka/enable-pg-ci-macos-windows branch 4 times, most recently from b3e3fac to 2739223 Compare November 26, 2024 12:57
@hugodutka hugodutka force-pushed the hugodutka/enable-pg-ci-macos-windows branch from f1394a8 to 160e77c Compare December 2, 2024 14:50
@hugodutka
Copy link
Contributor Author

hugodutka commented Dec 2, 2024

@deansheather @ethanndickson The Windows Postgres tests seem stable now. I made one major change in this PR to fix them: Windows now sets up a RAM disk for Postgres. I found that the flakiness was due to timing-sensitive tests failing because the suite ran slowly on Windows. The root cause was IO performance on GitHub-hosted Windows runners. Using a RAM disk bypasses this issue, and test times are back to normal.

I’ve also made the Windows PG tests required since they’re no longer that flaky - I see an odd flake every couple of runs. I’d appreciate a second look before merging this into main. Thanks!

Comment on lines +14 to +15
curl -L -o files.cab https://github.com/coder/imdisk-artifacts/raw/92a17839ebc0ee3e69be019f66b3e9b5d2de4482/files.cab
curl -L -o install.bat https://github.com/coder/imdisk-artifacts/raw/92a17839ebc0ee3e69be019f66b3e9b5d2de4482/install.bat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sucks that people still use sourceforge lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's Windows for you

Comment on lines +32 to +45
// We execute these queries instead of using the embeddedpostgres
// StartParams because it doesn't work on Windows. The library
// seems to have a bug where it sends malformed parameters to
// pg_ctl. It encloses each parameter in single quotes, which
// Windows can't handle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the issue tracking link to this comment?

@hugodutka hugodutka merged commit c7c35ef into main Dec 3, 2024
31 of 33 checks passed
@hugodutka hugodutka deleted the hugodutka/enable-pg-ci-macos-windows branch December 3, 2024 12:33
dannykopping added a commit that referenced this pull request Jan 20, 2025
We have an effort underway to replace `dbmem` (#15109), and consequently
we've begun running our full test-suite (with Postgres) on all supported
OSs - Windows, MacOS, and Linux, since #15520.

Since this change, we've seen a marked decrease in the success rate of
our builds on `main` (note how the Windows/MacOS failures account for
the vast majority of failed builds):


![image](https://github.com/user-attachments/assets/a02c15b7-037d-428a-a600-2aed60553ac0)

We're still investigating why these OSs are a lot less reliable. It's
likely that the VMs on which the builds are run have different
characteristics from our Ubuntu runners such as disk I/O, network
latency, or something else.

**In the meantime, we need to start trusting CI failures in `main`
again, as the current failures are too noisy / vague for us to
correct.**

We've also considered hosting our own runners where possible so we can
get OS-level observability to rule out some possibilities.

See the [meeting
notes](https://www.notion.so/coderhq/CI-Investigation-Call-Notes-17dd579be59280d8897cc9fe4bb46695?pvs=6&utm_content=17dd579b-e592-80d8-897c-c9fe4bb46695&utm_campaign=T1ZPT2FL0&n=slack&n=slack_link_unfurl)
where we linked into this for more detail.

This PR introduces several changes:

1. Moves the full test-suite with Postgres on Windows/MacOS to the
`nightly-gauntlet` workflow
tradeoff: this means that any regressions may be more difficult to
discover since we merge to main several times a day
2. Run only the CLI test-suite on each PR / merge to `main` on
Windows/MacOS
3. `test-go` is still running the full test-suite against all OSs
(including the CLI ones), but will soon be removed once #15109 is
completed since it uses `dbmem`
4. Changes `nightly-gauntlet` to run at 4AM: we've seen several
instances of the runner being stopped externally, and we're _guessing_
this may have something to do with the midnight UTC execution time, when
other cron jobs may run
5. Removes the existing `nightly-gauntlet` jobs since they haven't
passed in a long time, indicating that nobody cares enough to fix them
and they don't provide diagnostic value; we can restore them later if
necessary

I've manually run both these new workflows successfully:

- `ci`:
https://github.com/coder/coder/actions/runs/12825874176/job/35764724907
- `nightly-gauntlet`:
https://github.com/coder/coder/actions/runs/12825539092

---------

Signed-off-by: Danny Kopping <[email protected]>
Co-authored-by: Muhammad Atif Ali <[email protected]>
SasSwart pushed a commit that referenced this pull request Jan 22, 2025
We have an effort underway to replace `dbmem` (#15109), and consequently
we've begun running our full test-suite (with Postgres) on all supported
OSs - Windows, MacOS, and Linux, since #15520.

Since this change, we've seen a marked decrease in the success rate of
our builds on `main` (note how the Windows/MacOS failures account for
the vast majority of failed builds):


![image](https://github.com/user-attachments/assets/a02c15b7-037d-428a-a600-2aed60553ac0)

We're still investigating why these OSs are a lot less reliable. It's
likely that the VMs on which the builds are run have different
characteristics from our Ubuntu runners such as disk I/O, network
latency, or something else.

**In the meantime, we need to start trusting CI failures in `main`
again, as the current failures are too noisy / vague for us to
correct.**

We've also considered hosting our own runners where possible so we can
get OS-level observability to rule out some possibilities.

See the [meeting
notes](https://www.notion.so/coderhq/CI-Investigation-Call-Notes-17dd579be59280d8897cc9fe4bb46695?pvs=6&utm_content=17dd579b-e592-80d8-897c-c9fe4bb46695&utm_campaign=T1ZPT2FL0&n=slack&n=slack_link_unfurl)
where we linked into this for more detail.

This PR introduces several changes:

1. Moves the full test-suite with Postgres on Windows/MacOS to the
`nightly-gauntlet` workflow
tradeoff: this means that any regressions may be more difficult to
discover since we merge to main several times a day
2. Run only the CLI test-suite on each PR / merge to `main` on
Windows/MacOS
3. `test-go` is still running the full test-suite against all OSs
(including the CLI ones), but will soon be removed once #15109 is
completed since it uses `dbmem`
4. Changes `nightly-gauntlet` to run at 4AM: we've seen several
instances of the runner being stopped externally, and we're _guessing_
this may have something to do with the midnight UTC execution time, when
other cron jobs may run
5. Removes the existing `nightly-gauntlet` jobs since they haven't
passed in a long time, indicating that nobody cares enough to fix them
and they don't provide diagnostic value; we can restore them later if
necessary

I've manually run both these new workflows successfully:

- `ci`:
https://github.com/coder/coder/actions/runs/12825874176/job/35764724907
- `nightly-gauntlet`:
https://github.com/coder/coder/actions/runs/12825539092

---------

Signed-off-by: Danny Kopping <[email protected]>
Co-authored-by: Muhammad Atif Ali <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants