Skip to content

[Resolve OOM When Reading Large Logs in Webserver] Refactor to Use K-Way Merge for Log Streams Instead of Sorting Entire Log Records #45129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jason810496
Copy link
Member

related: #45079


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk potiuk force-pushed the refactor/webserver-oom-for-large-log-read branch from ed334f7 to 1d0e6ed Compare December 21, 2024 08:32
@potiuk
Copy link
Member

potiuk commented Dec 21, 2024

Rebased after we fixed main issue

@jason810496 jason810496 force-pushed the refactor/webserver-oom-for-large-log-read branch from 1d0e6ed to 8617e5b Compare December 21, 2024 10:09
@jason810496 jason810496 force-pushed the refactor/webserver-oom-for-large-log-read branch from 0f19a8b to ef3450b Compare December 23, 2024 12:44
@jason810496
Copy link
Member Author

CI is failing due to: Please ask the maintainer to assign the 'legacy api' label to the PR in order to continue.

Since the get_log endpoint in both the legacy API and FastAPI uses the read_log_chunks method, it’s necessary to fix the endpoints and their corresponding tests.

@potiuk potiuk added legacy ui Whether legacy UI change should be allowed in PR legacy api Whether legacy API changes should be allowed in PR labels Dec 23, 2024
@potiuk potiuk closed this Dec 23, 2024
@potiuk potiuk reopened this Dec 23, 2024
@potiuk
Copy link
Member

potiuk commented Dec 23, 2024

Applied and closed/reopened to trigger the build

@jason810496 jason810496 force-pushed the refactor/webserver-oom-for-large-log-read branch from ef3450b to 0aaf0ab Compare December 25, 2024 04:18
@jason810496
Copy link
Member Author

Fix the provider tests that explicitly use the read or _read methods.

@jason810496 jason810496 force-pushed the refactor/webserver-oom-for-large-log-read branch from 3aac539 to 46e30e3 Compare December 26, 2024 04:54
@jason810496 jason810496 changed the title WIP: [Resolve OOM When Reading Large Logs in Webserver] Refactor to Use K-Way Merge for Log Streams Instead of Sorting Entire Log Records [Resolve OOM When Reading Large Logs in Webserver] Refactor to Use K-Way Merge for Log Streams Instead of Sorting Entire Log Records Dec 26, 2024
@jason810496 jason810496 force-pushed the refactor/webserver-oom-for-large-log-read branch from 46e30e3 to 1802ed1 Compare December 26, 2024 05:46
@jason810496 jason810496 marked this pull request as ready for review December 26, 2024 06:39
@jason810496
Copy link
Member Author

Finally fixed the tests!

This is the first (and likely the largest) PR for resolving OOM issues when reading large logs in the webserver.
Further PRs will only focus on refactoring each provider, as listed in the TODO tasks in #45079.

Even though the providers haven't yet been refactored to support stream-based log reading, the compatibility utility will transform the old read log method (which returns the entire list of logs) into a stream-based approach. Once all providers are refactored to use stream-based reading, the compatibility utility can be removed.

For the testing part:
Since the CI will run provider compatibility tests for versions 2.9.3 and 2.10.3, my approach is to copy the old test cases related to log reading into new stream-based tests. I’ve added the mark_test_for_old_read_log_method and mark_test_for_stream_based_read_log_method pytest decorators to selectively skip the corresponding test runs.
From my perspective, this approach is simpler and minimizes changes to the original test logic. Additionally, tests marked with mark_test_for_old_read_log_method can be safely removed once all providers migrate to stream-based reading.

@jason810496 jason810496 force-pushed the refactor/webserver-oom-for-large-log-read branch from 1802ed1 to 091407e Compare January 1, 2025 16:00
- add check log_stream type utils
- fix type checking for
    - test_file_task_handler_when_ti_value_is_invalid
    - test_file_task_handler
    - test_file_task_handler_running
    - test_file_task_handler_rotate_size_limit
    - test__read_when_local
    - test__read_served_logs_checked_when_done_and_no_local_or_remote_logs
- also test compatible interface for test__read_served_logs_checked_when_done_and_no_local_or_remote_logs
    - which might call _read_remote_logs
- Since read_log_chunks is public method, refactor it as same return
  type in orignial implementation to avoid breaking change
- The `host` should only show once in read_log_stream
- Fix mock_read to new stream-based reading in test_log_reader
- Fix test for expecting stdout of callable should be in log lines
- Logs might not be add to heap in first round, should consider
  log_streams instead of heap
- Make it compatible for providers that implemented _read method.
- Should handle input list is empty.
- Copy old test case that use read or _read methods
- Add mark_test_for_stream_based_read_log_method and
  mark_test_for_old_read_log_method to skip corresponding CI tests
@jason810496 jason810496 force-pushed the refactor/webserver-oom-for-large-log-read branch from 8aad65a to 8f89f5c Compare February 18, 2025 07:51
@jason810496
Copy link
Member Author

I'm still half way finish reviewing it. Left a few nitpicks, but the PR is great to be honest.

Thanks, @Lee-W, for reviewing! I’ve just resolved those nits.

The CI failure is due to a flaky test:

FAILED tests/operators/test_trigger_dagrun.py::TestDagRunOperator::test_trigger_dagrun - AssertionError: assert equals failed
  '2025-02-18T08:18:13'  '2025-02-18T08:18:14'

Copy link

github-actions bot commented Apr 5, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Apr 5, 2025
@jason810496
Copy link
Member Author

Since the TaskHandler logger being migrate to structlog, I will create another PR for the refactor instead of resolve conflict on this one( too much code change and conflict on this path recently)

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Apr 8, 2025
@Lee-W
Copy link
Member

Lee-W commented Apr 8, 2025

Since the TaskHandler logger being migrate to structlog, I will create another PR for the refactor instead of resolve conflict on this one( too much code change and conflict on this path recently)

If that's the case, maybe we could mark this as draft or close and create a new one instead?

@jason810496 jason810496 marked this pull request as draft April 8, 2025 09:58
@kaxil kaxil modified the milestones: Airflow 2.10.6, Airflow 2.11.0 Apr 29, 2025
@jason810496
Copy link
Member Author

Close this PR since it’s superseded by:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:logging legacy api Whether legacy API changes should be allowed in PR legacy ui Whether legacy UI change should be allowed in PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants