Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple file writers writing at once. #1063

Open
chihuahua opened this issue Mar 19, 2018 · 12 comments
Open

Support multiple file writers writing at once. #1063

chihuahua opened this issue Mar 19, 2018 · 12 comments
Labels
core:backend core:summaries tf-2.0 theme:performance Performance, scalability, large data sizes, slowness, etc.

Comments

@chihuahua
Copy link
Member

Background

TensorBoard assigns one DirectoryWatcher per run. The DirectoryWatcher watches for new events being written to the directory and loads it into TensorBoard's multiplexer. Today, the DirectoryWatcher iterates through events files (for a run) in lexicographic order. After it has finished reading from an events file, it never reads from it again (and moves on to a subsequent events file).

This behavior means that TensorBoard does not support multiple file writers writing to the same run directory: TensorBoard would move on to a different events file B once it finishes reading events file A and would never go back to reading A despite how A is updated.

In turn, that problem blocks TensorFlow+TensorBoard's transition towards migrating Estimator, tflearn hooks, and other high-level TensorFlow constructs to using the tf.contrib.summary-style summaries. tf.contrib.summary-style summaries are written via a type of SummaryWriter different from the one that writes traditional-style (tf.summary.*) summaries, so users that intermingle the summaries will necessarily have to concurrently use 2 summary writers. Many users thoughout google3 intermingle the summaries, and the transition must happen soon.

@nfelt
Copy link
Collaborator

nfelt commented Mar 20, 2018

Just jotting down what we discussed while it's fresh in my head. We keep the essential structure of DirectoryWatcher - Load() spools events from the current _loader until it's exhausted, and then moves on to a new file.

However, we change the algorithm for moving on to a new file to do the following:

  1. (setup) As state, we keep a "stream map" mapping from file suffixes to "stream entry" values that hold a loader instance, a timestamp, and a list of filenames
  2. List the directory and get all filenames
  3. Filter out any filenames that are "dead" (using a set of known "dead" files)
  4. Find the next filename, per the following substeps:
    3a. Order the filenames by timestamp, then by suffix.
    3b. Seek ahead in the list to the first element with timestamp and suffix >= the current timestamp and suffix.
    3c. Check if element.timestamp is >= to the timestamp value stored for element.suffix in the stream map. If not, log an "out of order files" error and keep going until this is true.
    3d. If this process failed to find a filename, null out the current timestamp and suffix and return. (We'll resume with the next Load() call.)
  5. Yield any remaining files from the current stream's loader (to handle sequence described here).
  6. If the last event to be yielded from the current stream's loader was > MAX_INACTIVE_AGE ago, add the current filename to the set of "dead" files, close the loader, and purge the current suffix's entry from the stream map
  7. Set current timestamp and current suffix to those from the new filename
  8. Look up the current suffix in the stream map
  9. If there's an entry, check if the last file in list-of-files is the same as the new filename. If so, we're done early and the routine ends here.
  10. If there's an entry but the filename is different, do the following substeps:
    9a. Close the loader in the stream entry, stat the stream entry's filename and record its size as the finalized size for that filename in a separate finalized_sizes map
    9b. Add that filename to the set of dead files
    9c. [Optional] check the last OOO_WRITE_CHECK_COUNT filenames in the list of files to see if they have out-of-order writes (meaning, they exist and their size != their finalized_sizes size) and if so log an error
  11. If the current suffix didn't have a stream map entry, create one.
  12. With the stream map entry:
    11a. Update the timestamp to the current timestamp
    11b. Append the new filename to the list-of-files
    11c. Open a new loader for the current filename and set it as the stream map entry's loader

We're proposing that MAX_INACTIVE_AGE be set to 24 hours (or perhaps slightly more to allow for files written exactly once every 24 hours). OOO_WRITE_CHECK_COUNT is currently 20 but I think it could be lower (like 1 or 2, really).

The goal of this new algorithm is that we still proceed with loading files by timestamp order, but once we get to the end of that order, we loop back to the beginning and re-read files again. This handles the case where there are multiple active streams of data in different files. To keep from having to re-read every file, we mark files as "dead" in two different ways: 1) if a file hasn't been written to in MAX_INACTIVE_AGE (as determined by last even time) and 2) if a file is preempted by a newer-timestamped file with the same suffix. However, files that are dead of type 2 we still keep around to optionally check if they have been modified in case we want to detect out-of-order writes. The end result should be that for a static set of files (e.g. a terminated experiment), we do read all the data, but that once we've gone through the initial set of files, we only reload data from an set of active "head" files of the streams that might still be open, which is one stream per suffix for files written to in the last 24 hours. This keeps us from having to keep event file readers open and querying for new data for every single event file, even for very old logdirs.

@chihuahua
Copy link
Member Author

@nfelt, thank you for that recap and outline! #1064

@chihuahua
Copy link
Member Author

@nfelt, your plan as outlined above and then adding tests sounds great. We probably want to merge this change soon to unblock the migration to tf.contrib summaries. Do you want to take over this effort because I'm OOO?

@nfelt
Copy link
Collaborator

nfelt commented Mar 21, 2018

@chihuahua oh shoot, I forgot you were out starting tomorrow... hmm we probably should have started the code in a separate file so we could check it in and then keep working on tests, etc. I guess I'll just grab a copy of the code from your PR and make a new PR? Unless you have time to quickly change the current PR to target a separate file, and then we could merge it as work-in-progress and I could take over from there.

@chihuahua
Copy link
Member Author

Ah yes. Within #1064, I moved the changes to a new suffix_based_directory_watcher.py file (the name is definitely tentative). We can check it in and iterate. Or, you could merge the changes into your own new PR. Thanks for the idea.

@andrewzu
Copy link

Is this feature still planned to be supported?

@gogasca
Copy link

gogasca commented Jun 22, 2019

eval_spec = tf.estimator.EvalSpec(
        eval_input,
        steps=args.eval_steps,
        throttle_secs=1,
        exporters=[exporter])

    run_config = tf.estimator.RunConfig(
        session_config=_get_session_config_from_env_var(),
        save_checkpoints_steps=100)

throttle_secs and save_checkpoints_steps did the trick for me

@nfelt
Copy link
Collaborator

nfelt commented Jul 29, 2019

I've merged #1867 which introduces the experimental --reload_multifile=true option, which can now be used to poll all "active" files in a directory for new data, rather than the most recent one, and addresses the core concern of this issue. Note that a file is "active" as long as it received new data within --reload_multifile_inactive_secs seconds ago, a new configurable flag that defaults to 4000, aka just over an hour.

If you've been waiting on this functionality, thanks for your patience, and please try out the new flag in tb-nightly starting with tomorrow's release and let us know how it works for you.

@somedadaism
Copy link

Just in time!
Thanks a lot!

@dsmic
Copy link

dsmic commented Aug 21, 2019

Since in tf-2.0 beta1 the tensorflow.keras.callbacks.TensorBoard() callback seems to relay on that feature, it might be a good idea to enable it by default or (if that is not an option) give a hint to that feature in the warning E0821.

@bileschi bileschi added the theme:performance Performance, scalability, large data sizes, slowness, etc. label Jan 2, 2020
tomaskrehlik added a commit to tomaskrehlik/GPflow that referenced this issue Apr 10, 2020
Tensorboard does not work well when multiple file writers write into
the same directory, see [related issue](tensorflow/tensorboard#1063).
The monitor creates a file writer with every summary. This pull
request resolves this issue by creating a single writer per run and
depending on that.
@pindinagesh
Copy link

@chihuahua

Closing this issue since PR is merged. Please feel free to reopen if this still exist. Thanks

@nfelt
Copy link
Collaborator

nfelt commented Nov 30, 2021

This issue should stay open, since the flag is still experimental and not on by default yet.

@nfelt nfelt reopened this Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core:backend core:summaries tf-2.0 theme:performance Performance, scalability, large data sizes, slowness, etc.
Projects
None yet
Development

No branches or pull requests

8 participants