-
Notifications
You must be signed in to change notification settings - Fork 393
Flux1 Kontext (Dev) support #707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I wonder if we should use the input image dimensions as defaults for the output dimensions. |
You are right. I guess this is good behavior then. |
How does the kontext img interact with an img2img source image? How does the flow go? I also noticed you're not actually limiting the kontext imgs from being used with regular flux. I wonder how that looks. Anyway, it's working well. very good work. Merging was a bit of a pain tho, since the chroma PR isn't accepted yet so there's a bunch of conflicts. Meanwhile your bleedingedge branch also has other stuff. But I managed. |
Thank you for your contribution. However, interestingly, I also added support for kontext dev and added the edit mode. I pushed some of the changes to your branch to make the code easier to maintain. |
That's because Regular Flux and Flux Kontext have the exact same architecture, so I haven't found a way to tell them apart at runtime. Regular Flux get very confused by the reference images, but it does its best.
To be honest I haven't tried that configuration yet. But I don't think It should cause any weird interactions, The latent image is initialized with the source image (instead of empty), and then the model starts addig some noise and denoising at an advenced timestep like normal img2img. It's just conditioned by the reference image. (though with the edit mode it's no longer a concern) |
@stduhpf I noticed your My prompt is "display the images side by side" Result: Tried a few more times with equally odd results, often the second image is just completely ignored, |
In the paper, they say: |
@leejet I'm not convinced it's usefull to make such a distinction between the "edit" and "txt2img" modes. Isn't edit mode is just txt2img with image conditionning? |
uint64_t curr_h_offset = 0; | ||
uint64_t curr_w_offset = 0; | ||
for (ggml_tensor* ref : ref_latents) { | ||
uint64_t h_offset = 0; | ||
uint64_t w_offset = 0; | ||
if (ref->ne[1] + curr_h_offset > ref->ne[0] + curr_w_offset) { | ||
w_offset = curr_w_offset; | ||
} else { | ||
h_offset = curr_h_offset; | ||
} | ||
|
||
auto ref_ids = gen_img_ids(ref->ne[1], ref->ne[0], patch_size, bs, 1, h_offset, w_offset); | ||
ids = concat_ids(ids, ref_ids, bs); | ||
|
||
curr_h_offset = std::max(curr_h_offset, ref->ne[1] + h_offset); | ||
curr_w_offset = std::max(curr_w_offset, ref->ne[0] + w_offset); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I undesrtand correctly this is "stitching" the reference images together (in the same 3D postional encoding slice) instead of putting them each on their own "slice" like in the paper? It seems to work very well with this model though, but it will need to be changed again in the future if a model with "true" support for multiple references gets released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on the implementation of comfyui. In my tests, this implementation performed better when dealing with multiple reference images. I think the current kontext dev model supports multiple reference images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current kontext dev model supports multiple reference images.
Well, as I understand it, with this implementation, it kind of acts like all the reference images are just one big reference mosaic (well not quite since the VAE encodes them separately, but they are positioned on the same "RoPE plane" of index 1 if that makes sense). Anyways, I agree that this implementation is better, at least for as long there is no model that does support reference images with different indices.
@stduhpf bear with me a bit, this is just some rambling, so I did some searching on reddit, and saw this comfyui workflow: https://www.reddit.com/r/comfyui/comments/1l2zsz2/flux_kontext_is_amazing/ Here, Flux Kontext ingests multiple separate images and then can generate a composite image containing all 3 subjects. What is not entirely clear is whether it's fed into the model as one source reference image, or several. I tried the current implementation in sd.cpp, with the same prompt and these 2 images: but instead, I have received Walter Ramsay. |
@LostRuins This is with commit 8967889? |
nope, this is with your earlier changes. Should I merge the latest? |
@leejet's changes kind of fixed multiple references by virtually stitching the reference images together using RoPE offsets so the model "sees" them as one mosaic, rather than implementing it like suggested in the original paper with each image being clearly separated in a third dimension. (It seems that ComfyUI does the same) |
I agree, it seems unnecessary to duplicate a whole different flow for this particular case. After all, we already add photomaker, controlnet and others directly in txt2img and img2img, I don't see why this should be different. |
Alright seems mostly working now, very nice. Are there any guidelines to follow regarding the size dimensions of the kontext input images? Should they be resized to output dims, aspect ratio, or not needed? |
https://github.com/comfyanonymous/ComfyUI/blob/master/comfy_extras/nodes_flux.py#L60-L100 |
Because the edit model and the txt2img model are different models, from the user's perspective, distinguishing between the edit mode and the txt2img mode is a more user-friendly approach, although both have significant similarities in their workflows. |
This is based on the implementation of comfyui. In my tests, this implementation performed better when dealing with multiple reference images. |
According to my test results, the resolution of the reference image has little impact. Actually, I suggest reducing the size of the image when the VRAM is limited, in order to reduce the VRAM usage and improve the generation speed. |
@stduhpf that civitai clothing lora got deleted so I didn't see your reply - but I was saying that I don't think flux loras are actually working correctly on Kontext, similar to your own observation. |
It should be expected for Flux [Dev] LoRAs not to work very well with Flux Kontext [Dev], Flux Kontext [Dev] is distilled from Flux Konext [Pro] rather than fine-tuned from Flux [Dev] on edit tasks, so it should be expected for the two models to be quite different. But I noticed that some loras Like the ones to reduce the number of steps seem to work somewhat wich is interesting. They probably used Flux [Dev] as a base for the distillation. |
It seems that this PR can be merged now. Thank you everyone! |
https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev/
Usage:
sd.exe -M edit --diffusion-model ..\models\unet\flux1-kontext-dev.safetensors --clip_l .\models\clip\clip_l\clip_l.safetensors --t5xxl .\models\clip\t5\t5xxl_fp16.safetensors --vae .\models\vae\flux\ae.f16.gguf --cfg-scale 1 --sampling-method euler --steps 20 --color -v --guidance 2.5 -p 'Prompt' -r reference.png
Example outputs: