Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA OOM Issues #8

Open
rikabi89 opened this issue Feb 17, 2023 · 10 comments
Open

CUDA OOM Issues #8

rikabi89 opened this issue Feb 17, 2023 · 10 comments

Comments

@rikabi89
Copy link

Sorry to be here again.

I have a 3070 8GB

Now my dataset is fine. I keep getting cuda errros. I've identified 3 places in the yml I can edit to reduce batch sizes but even putting it to 1 gets me an error.

I've also tried changing mega_batch_factor: as your notes.

I tried a much smaller dataset of 600 wav files.

I get this :

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ H:\DL-Art-School\codes\train.py:370 in <module> │ │ │ │ 367 │ │ torch.cuda.set_device(torch.distributed.get_rank()) │ │ 368 │ │ │ 369 │ trainer.init(args.opt, opt, args.launcher) │ │ ❱ 370 │ trainer.do_training() │ │ 371 │ │ │ │ H:\DL-Art-School\codes\train.py:325 in do_training │ │ │ │ 322 │ │ │ │ │ 323 │ │ │ _t = time() │ │ 324 │ │ │ for train_data in tq_ldr: │ │ ❱ 325 │ │ │ │ self.do_step(train_data) │ │ 326 │ │ │ 327 │ def create_training_generator(self, index): │ │ 328 │ │ self.logger.info('Start training from epoch: {:d}, iter: {:d}'.format(self.start │ │ │ │ H:\DL-Art-School\codes\train.py:206 in do_step │ │ │ │ 203 │ │ │ print("Update LR: %f" % (time() - _t)) │ │ 204 │ │ _t = time() │ │ 205 │ │ self.model.feed_data(train_data, self.current_step) │ │ ❱ 206 │ │ gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_g │ │ 207 │ │ iteration_rate = (time() - _t) / batch_size │ │ 208 │ │ if self._profile: │ │ 209 │ │ │ print("Model feed + step: %f" % (time() - _t)) │ │ │ │ H:\DL-Art-School\codes\trainer\ExtensibleTrainer.py:302 in optimize_parameters │ │ │ │ 299 │ │ │ new_states = {} │ │ 300 │ │ │ self.batch_size_optimizer.focus(net) │ │ 301 │ │ │ for m in range(self.batch_factor): │ │ ❱ 302 │ │ │ │ ns = step.do_forward_backward(state, m, step_num, train=train_step, no_d │ │ 303 │ │ │ │ # Call into post-backward hooks. │ │ 304 │ │ │ │ for name, net in self.networks.items(): │ │ 305 │ │ │ │ │ if hasattr(net.module, "after_backward"): │ │ │ │ H:\DL-Art-School\codes\trainer\steps.py:214 in do_forward_backward │ │ │ │ 211 │ │ local_state = {} # <-- Will store the entire local state to be passed to inject │ │ 212 │ │ new_state = {} # <-- Will store state values created by this step for returning │ │ 213 │ │ for k, v in state.items(): │ │ ❱ 214 │ │ │ local_state[k] = v[grad_accum_step] │ │ 215 │ │ local_state['train_nets'] = str(self.get_networks_trained()) │ │ 216 │ │ loss_accumulator = self.loss_accumulator if loss_accumulator is None else loss_a │ │ 217 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ IndexError: list index out of range

@152334H
Copy link
Owner

152334H commented Feb 17, 2023

I have a 3070 8GB

That's a big problem... batch size and vram usage are only partially correlated, there is a minimum amount of vram to load the full optimizer states of the GPT model.

There is one immediate thing you could attempt -- enable FP16 training. Keep the batch sizes at a reasonable level (some multiple of mega_batch_factor) to prevent the error from occurring.

If that is not sufficient, then more complicated efforts will be required to reduce vram usage. One possible effort would be to preprocess the dataset into quantized mels, rather than making use of the vqvae on the fly. But personally I would recommend using the new colab notebook if it doesn't work at that point.

@rikabi89
Copy link
Author

ah ok fair enough. I did mess around with FP16 and other settings.

Now when I changed heads to 1, I got this :

image
I guess this means nothing is happening?

@152334H
Copy link
Owner

152334H commented Feb 17, 2023

changed heads to 1,

That will not work. The architecture of the model must not be adjusted; you will get nonsense results if the model isn't fully loaded.

@152334H
Copy link
Owner

152334H commented Feb 17, 2023

if someone knew how to implement LORA, it might be applicable to this situation

but I think colab is the best option for now. I will close this issue until the situation changes.

@152334H 152334H closed this as completed Feb 17, 2023
@Anomyous1
Copy link

Anomyous1 commented Feb 22, 2023

have you tried implementing it in colossal AI? it claims to get 1.5x to 8x speedup on PCs for training OPT and GPT type models through larger ram pagefile magic

if someone knew how to implement LORA, it might be applicable to this situation

but I think colab is the best option for now. I will close this issue until the situation changes.

@152334H
Copy link
Owner

152334H commented Feb 23, 2023

have you tried implementing it in colossal AI?

The primary problem with using ColossalAI, or any other "GPT-2 infer/train Speedup" project, is that the GPT model here is not exactly the same as a normal GPT-2 model. It injects the conditional latents into the input embeddings on the first forward pass (or on all forward passes when there is no kv_cache). A speedup framework that doesn't expose callbacks at the forward pass (which is all I have seen) has to be redeveloped in some manner

It is possible I am missing some obvious performance gains, but so far integration has not been an immediately obvious process for me

@152334H
Copy link
Owner

152334H commented Feb 23, 2023

bitsandbytes

@152334H 152334H reopened this Feb 23, 2023
@152334H
Copy link
Owner

152334H commented Feb 23, 2023

Following the mrq implementation, I have added 8bit training with bitsandbytes in 091c6b1

However, this will only work for Linux, because Windows has issues with direct pip installations for bnb. Paging @devilismyfriend for help here

@152334H
Copy link
Owner

152334H commented Feb 23, 2023

no bnb, bs=125

image

bnb (without fp16), bs=125

image

for some reason my training seems to be substantially slower when I apply bnb with fp16 checked, while also not lowering memory use at all. To investigate.

@devilismyfriend
Copy link
Collaborator

Yeah should be an easy fix for windows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants