Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mnist_distributed.py example doesn't work with TF 1.11 #42

Closed
oliverhu opened this issue Oct 17, 2018 · 11 comments
Closed

mnist_distributed.py example doesn't work with TF 1.11 #42

oliverhu opened this issue Oct 17, 2018 · 11 comments
Assignees

Comments

@oliverhu
Copy link
Member

mnist_distributed.py example doesn't work with TF 1.11. It complains FLAGS doesn't have ports or logdir

@oliverhu
Copy link
Member Author

@erwa does this work with 1.12 rc?

@erwa
Copy link
Contributor

erwa commented Oct 18, 2018

Will try it out when I get a chance

@gogasca
Copy link
Contributor

gogasca commented Nov 20, 2018

I tested with TF 1.12 and same issue. We will be releasing TF 2.0 early next year, as preparation: We are removing tf.flags and tf.app.
We probably need to define flags manually (via argparse/absl). Will take a look and submit PR.

2018-11-20 08:15:04.552459: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-20 08:15:04.559516: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> tony-staging-w-0.c.dpe-cloud-mle.internal:41051}
2018-11-20 08:15:04.559583: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:43585, 1 -> tony-staging-w-0.c.dpe-cloud-mle.internal:33949}
2018-11-20 08:15:04.560233: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:43585
Traceback (most recent call last):
  File "/usr/local/src/jobs/TFJob/src/mnist_distributed.py", line 246, in <module>
    tf.app.run()
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1542587994073_0025/container_1542587994073_0025_01_000003/venv/tf112/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/usr/local/src/jobs/TFJob/src/mnist_distributed.py", line 214, in main
    start_tensorboard(FLAGS.working_dir)
  File "/usr/local/src/jobs/TFJob/src/mnist_distributed.py", line 176, in start_tensorboard
    FLAGS.logdir = checkpoint_dir
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1542587994073_0025/container_1542587994073_0025_01_000003/venv/tf112/lib/python3.5/site-packages/tensorflow/python/platform/flags.py", line 88, in __setattr__
    return self.__dict__['__wrapped'].__setattr__(name, value)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1542587994073_0025/container_1542587994073_0025_01_000003/venv/tf112/lib/python3.5/site-packages/absl/flags/_flagvalues.py", line 498, in __setattr__
    return self._set_unknown_flag(name, value)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1542587994073_0025/container_1542587994073_0025_01_000003/venv/tf112/lib/python3.5/site-packages/absl/flags/_flagvalues.py", line 374, in _set_unknown_flag
    raise _exceptions.UnrecognizedFlagError(name, value)
absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'logdir'

@oliverhu
Copy link
Member Author

That would be great, thanks @gogasca

@gogasca
Copy link
Contributor

gogasca commented Nov 30, 2018

Hi Oliver,
TF 1.12 changed the way of handling flags,
I opened this to track it: tensorflow/tensorboard#1642

This is the new way to launch TB.

from tensorboard.plugins.core import core_plugin
import tensorboard.program as tb_program

def start_tensorboard(logdir):
	tb = tb_program.TensorBoard(plugins=[core_plugin.CorePluginLoader()])
	port = os.getenv('TB_PORT_ENV_VAR', 6006)
	tb.configure(logdir=logdir, port=port)
	tb.launch()
	logging.info("Starting TensorBoard with --logdir=" + logdir)

Workaround in official TF pip install 1.12 is to edit tensorboard/program.py and add a "not" in conditional:

    for k, v in kwargs.items():
      if not hasattr(flags, k):
        raise ValueError('Unknown TensorBoard flag: %s' % k)

Options:

  1. We are looking to see if we can release a TB 1.12.1 patch. Otherwise file needs to be edited an then zip the env.
  2. if you try tf-nightly this should work as well.

I will hold on the PR until new TF gets release.

@oliverhu
Copy link
Member Author

oliverhu commented Dec 1, 2018

Hey @gogasca , do you mean the latest released TB is not compatible with latest released TF?

@oliverhu
Copy link
Member Author

oliverhu commented Dec 1, 2018

Got it. I feel TF evolves really fast (1 patch/rc version a week), we can wait for 1.12.1 if it is coming soon.

@erwa
Copy link
Contributor

erwa commented Dec 3, 2018

Thanks for the investigation, @gogasca. I verified that after installing tf-nightly and changing start_tensorboard to

from tensorboard.plugins.core import core_plugin
import tensorboard.program as tb_program

def start_tensorboard(logdir):
	tb = tb_program.TensorBoard(plugins=[core_plugin.CorePluginLoader()])
	port = os.getenv('TB_PORT_ENV_VAR', 6006)
	tb.configure(logdir=logdir, port=port)
	tb.launch()
	logging.info("Starting TensorBoard with --logdir=" + logdir)

TensorBoard works with the MNIST distributed code.

@gogasca
Copy link
Contributor

gogasca commented Jan 9, 2019

TF 1.13 (Last 1.x version before 2.x) will be released end of the next week

@gogasca
Copy link
Contributor

gogasca commented Jan 24, 2019

TF 1.13rc0 was released.

@oliverhu
Copy link
Member Author

fixed in #216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants