Skip to content

Flux1 Kontext (Dev) support #707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 29, 2025
Merged

Flux1 Kontext (Dev) support #707

merged 2 commits into from
Jun 29, 2025

Conversation

stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Jun 27, 2025

https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev/

Usage:

sd.exe -M edit --diffusion-model ..\models\unet\flux1-kontext-dev.safetensors --clip_l .\models\clip\clip_l\clip_l.safetensors --t5xxl .\models\clip\t5\t5xxl_fp16.safetensors --vae .\models\vae\flux\ae.f16.gguf --cfg-scale 1 --sampling-method euler --steps 20 --color -v --guidance 2.5 -p 'Prompt' -r reference.png

Example outputs:

reference image Replace the text with "KONTEXT" Change the background to a mountain backdrop The cat is walking on the roof, remove the sign
reference output output output

@Green-Sky
Copy link
Contributor

I wonder if we should use the input image dimensions as defaults for the output dimensions.

@Green-Sky
Copy link
Contributor

Green-Sky commented Jun 27, 2025

input prompt output
flux 1-lite-8B-f16 "add a plane into the sky" output

Looks like if the input image is smaller then (default) specified image resolution, it gets cropped to upper left.

768x768:

input prompt output
flux 1-lite-8B-f16 "add a plane into the sky" output

edit: Oh and this is with cuda.

@stduhpf
Copy link
Contributor Author

stduhpf commented Jun 27, 2025

Looks like if the input image is smaller then (default) specified image resolution, it gets cropped to upper left.

Hmm It looks like it tends to re-frame the image if the resolution don't match, but depending on the prompt/seed, the exact framing changes, and it deosn't seem to always just crop the top left of the reference image to the target resolution.

Example: You can see the big cloud and part of the slope on the right were included in this 512x512 image, these were missing from your example with the plane.
output

Here's what a simple 512x512 crop would look like:
image
That's even less that what's included in the plane image.

@Green-Sky
Copy link
Contributor

That's even less that what's included in the plane image.

You are right. I guess this is good behavior then.

@bssrdf
Copy link
Contributor

bssrdf commented Jun 27, 2025

input prompt output
image a beautiful girl model holding it image

@LostRuins
Copy link
Contributor

LostRuins commented Jun 28, 2025

How does the kontext img interact with an img2img source image? How does the flow go?

I also noticed you're not actually limiting the kontext imgs from being used with regular flux. I wonder how that looks.

Anyway, it's working well. very good work. Merging was a bit of a pain tho, since the chroma PR isn't accepted yet so there's a bunch of conflicts. Meanwhile your bleedingedge branch also has other stuff. But I managed.

@leejet
Copy link
Owner

leejet commented Jun 28, 2025

Thank you for your contribution. However, interestingly, I also added support for kontext dev and added the edit mode. I pushed some of the changes to your branch to make the code easier to maintain.

@leejet
Copy link
Owner

leejet commented Jun 28, 2025

.\bin\Release\sd.exe -M edit -r .\kontext_input.png --diffusion-model  ..\models\flux1-kontext-dev-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --clip_l ..\..\ComfyUI\models\clip\clip_l.safetensors --t5xxl ..\..\ComfyUI\models\clip\t5xxl_fp16.safetensors  -p "change 'flux.cpp' to 'kontext.cpp'" --cfg-scale 1.0 --sampling-method euler -v  --t5xxl ..\x5c..\x5cComfyUI\x5cmodels\x5cclip\x5ct5xxl_fp16.safetensors  -p "change 'flux.cpp' to 'kontext.cpp'" --cfg-scale 1.0 --sampling-method euler -v
ref_image prompt output
kontext_input change 'flux.cpp' to 'kontext.cpp' output

@stduhpf
Copy link
Contributor Author

stduhpf commented Jun 28, 2025

I also noticed you're not actually limiting the kontext imgs from being used with regular flux. I wonder how that looks.

That's because Regular Flux and Flux Kontext have the exact same architecture, so I haven't found a way to tell them apart at runtime. Regular Flux get very confused by the reference images, but it does its best.

How does the kontext img interact with an img2img source image? How does the flow go?

To be honest I haven't tried that configuration yet. But I don't think It should cause any weird interactions, The latent image is initialized with the source image (instead of empty), and then the model starts addig some noise and denoising at an advenced timestep like normal img2img. It's just conditioned by the reference image. (though with the edit mode it's no longer a concern)

@LostRuins
Copy link
Contributor

@stduhpf I noticed your kontext_imgs is a vector i.e. it supports multiple images, but I'm not sure if I am doing it right.

I used 2 kontext_imgs:
ball
Walter_White_S5B

My prompt is "display the images side by side"

Result:

image

Tried a few more times with equally odd results, often the second image is just completely ignored,
Am I using it correctly?

@stduhpf
Copy link
Contributor Author

stduhpf commented Jun 28, 2025

@stduhpf I noticed your kontext_imgs is a vector i.e. it supports multiple images, but I'm not sure if I am doing it right.

I used 2 kontext_imgs: ball Walter_White_S5B

My prompt is "display the images side by side"

Result:

image

Tried a few more times with equally odd results, often the second image is just completely ignored, Am I using it correctly?

In the paper, they say:
image
image
So my understanding is that while the currently released model wasn't trained to work with multiple images, a future release might be able to do that.

@stduhpf
Copy link
Contributor Author

stduhpf commented Jun 28, 2025

@leejet I'm not convinced it's usefull to make such a distinction between the "edit" and "txt2img" modes. Isn't edit mode is just txt2img with image conditionning?

Comment on lines +625 to +641
uint64_t curr_h_offset = 0;
uint64_t curr_w_offset = 0;
for (ggml_tensor* ref : ref_latents) {
uint64_t h_offset = 0;
uint64_t w_offset = 0;
if (ref->ne[1] + curr_h_offset > ref->ne[0] + curr_w_offset) {
w_offset = curr_w_offset;
} else {
h_offset = curr_h_offset;
}

auto ref_ids = gen_img_ids(ref->ne[1], ref->ne[0], patch_size, bs, 1, h_offset, w_offset);
ids = concat_ids(ids, ref_ids, bs);

curr_h_offset = std::max(curr_h_offset, ref->ne[1] + h_offset);
curr_w_offset = std::max(curr_w_offset, ref->ne[0] + w_offset);
}
Copy link
Contributor Author

@stduhpf stduhpf Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I undesrtand correctly this is "stitching" the reference images together (in the same 3D postional encoding slice) instead of putting them each on their own "slice" like in the paper? It seems to work very well with this model though, but it will need to be changed again in the future if a model with "true" support for multiple references gets released.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on the implementation of comfyui. In my tests, this implementation performed better when dealing with multiple reference images. I think the current kontext dev model supports multiple reference images.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current kontext dev model supports multiple reference images.

Well, as I understand it, with this implementation, it kind of acts like all the reference images are just one big reference mosaic (well not quite since the VAE encodes them separately, but they are positioned on the same "RoPE plane" of index 1 if that makes sense). Anyways, I agree that this implementation is better, at least for as long there is no model that does support reference images with different indices.

@LostRuins
Copy link
Contributor

@stduhpf bear with me a bit, this is just some rambling, so I did some searching on reddit, and saw this comfyui workflow: https://www.reddit.com/r/comfyui/comments/1l2zsz2/flux_kontext_is_amazing/

image

Here, Flux Kontext ingests multiple separate images and then can generate a composite image containing all 3 subjects.

image

What is not entirely clear is whether it's fed into the model as one source reference image, or several.

I tried the current implementation in sd.cpp, with the same prompt and these 2 images:

zzz Walter_White_S5B

but instead, I have received Walter Ramsay.

image

@stduhpf
Copy link
Contributor Author

stduhpf commented Jun 28, 2025

@LostRuins This is with commit 8967889?

@LostRuins
Copy link
Contributor

nope, this is with your earlier changes. Should I merge the latest?

@stduhpf
Copy link
Contributor Author

stduhpf commented Jun 28, 2025

@leejet's changes kind of fixed multiple references by virtually stitching the reference images together using RoPE offsets so the model "sees" them as one mosaic, rather than implementing it like suggested in the original paper with each image being clearly separated in a third dimension. (It seems that ComfyUI does the same)

image

@LostRuins
Copy link
Contributor

@leejet I'm not convinced it's usefull to make such a distinction between the "edit" and "txt2img" modes. Isn't edit mode is just txt2img with image conditionning?

I agree, it seems unnecessary to duplicate a whole different flow for this particular case. After all, we already add photomaker, controlnet and others directly in txt2img and img2img, I don't see why this should be different.

@LostRuins
Copy link
Contributor

Alright seems mostly working now, very nice.

Are there any guidelines to follow regarding the size dimensions of the kontext input images? Should they be resized to output dims, aspect ratio, or not needed?

@stduhpf
Copy link
Contributor Author

stduhpf commented Jun 28, 2025

Alright seems mostly working now, very nice.

Are there any guidelines to follow regarding the size dimensions of the kontext input images? Should they be resized to output dims, aspect ratio, or not needed?

https://github.com/comfyanonymous/ComfyUI/blob/master/comfy_extras/nodes_flux.py#L60-L100
Maybe this is what you're lookng for? But I tried a lot of resolutions that don't match these "optimal" ones and didn't have any issues.
Also ref images size increases the compute buffer as much as output image size does, and on Vulkan at least, compute buffer size is a scarce ressource (because of the allocation limit).

@leejet
Copy link
Owner

leejet commented Jun 28, 2025

@leejet I'm not convinced it's usefull to make such a distinction between the "edit" and "txt2img" modes. Isn't edit mode is just txt2img with image conditionning?

Because the edit model and the txt2img model are different models, from the user's perspective, distinguishing between the edit mode and the txt2img mode is a more user-friendly approach, although both have significant similarities in their workflows.

@leejet
Copy link
Owner

leejet commented Jun 28, 2025

@leejet's changes kind of fixed multiple references by virtually stitching the reference images together using RoPE offsets so the model "sees" them as one mosaic, rather than implementing it like suggested in the original paper with each image being clearly separated in a third dimension. (It seems that ComfyUI does the same)

image

This is based on the implementation of comfyui. In my tests, this implementation performed better when dealing with multiple reference images.

@leejet
Copy link
Owner

leejet commented Jun 28, 2025

Alright seems mostly working now, very nice.

Are there any guidelines to follow regarding the size dimensions of the kontext input images? Should they be resized to output dims, aspect ratio, or not needed?

According to my test results, the resolution of the reference image has little impact. Actually, I suggest reducing the size of the image when the VRAM is limited, in order to reduce the VRAM usage and improve the generation speed.

@LostRuins
Copy link
Contributor

@stduhpf that civitai clothing lora got deleted so I didn't see your reply - but I was saying that I don't think flux loras are actually working correctly on Kontext, similar to your own observation.

@stduhpf
Copy link
Contributor Author

stduhpf commented Jun 28, 2025

@stduhpf that civitai clothing lora got deleted so I didn't see your reply - but I was saying that I don't think flux loras are actually working correctly on Kontext, similar to your own observation.

It should be expected for Flux [Dev] LoRAs not to work very well with Flux Kontext [Dev], Flux Kontext [Dev] is distilled from Flux Konext [Pro] rather than fine-tuned from Flux [Dev] on edit tasks, so it should be expected for the two models to be quite different. But I noticed that some loras Like the ones to reduce the number of steps seem to work somewhat wich is interesting. They probably used Flux [Dev] as a base for the distillation.

@leejet
Copy link
Owner

leejet commented Jun 29, 2025

It seems that this PR can be merged now. Thank you everyone!

@leejet leejet merged commit c9b5735 into leejet:master Jun 29, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants